Hadoop 在配置单元表中使用JSON SerDe
我正在从下面的链接尝试JSON SerDe . 我将Json SerDe jar添加为Hadoop 在配置单元表中使用JSON SerDe,hadoop,hive,Hadoop,Hive,我正在从下面的链接尝试JSON SerDe . 我将Json SerDe jar添加为 ADD JAR /path-to/hive-json-serde.jar; 并将数据加载为 LOAD DATA LOCAL INPATH '/home/hduser/pradi/Test.json' INTO TABLE my_table; 它成功地加载了数据 但当查询数据为 从my_表中选择* 我从表中只得到一行 数据1 100更多数据1 123.001 json包含 {"fie
ADD JAR /path-to/hive-json-serde.jar;
并将数据加载为
LOAD DATA LOCAL INPATH '/home/hduser/pradi/Test.json' INTO TABLE my_table;
它成功地加载了数据
但当查询数据为
从my_表中选择*
我从表中只得到一行
数据1 100更多数据1 123.001
json包含
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
问题在哪里?为什么查询表时只有一行而不是4行。在/user/hive/warehouse/my_表中包含所有4行
我已经发布了test.json文件的内容。所以您可以看到查询只产生一行,如下所示
data1 100 more data1 123.001
我已将json文件更改为employee.json,其中包含 { “名字”:“迈克”, “姓氏”:“Chepesky”, “雇员编号”:1840192 } 也更改了表,但在查询表时显示空值
hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar
hive> create EXTERNAL table employees_json (firstName string, lastName string, employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds
hive> load data local inpath '/home/hduser/pradi/employees.json' into table employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds
hive>select * from employees_json;
OK
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.194 seconds
如果没有日志(请参阅),就很难判断发生了什么,以防产生疑问。只是一个简单的想法-如果它与SerdeProperty一起工作,您是否可以尝试这样做:
CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
"field1"="$.field1",
"field2"="$.field2",
"field3"="$.field3",
"field4"="$.field4"
);
ThinkBigAnalytics还提供了一个您可能想尝试的解决方案
更新:Test.json中的输入是无效的json,因此记录被折叠
有关更多详细信息,请参阅答案
我解决了类似的问题-
LOAD DATA LOCAL INPATH'${env:HOME}/path to json'
改写成表消息;
从消息中选择*;
嗨,Michael,即使在创建了表之后也出现了同样的问题,正如您对SERDEPROPERTIES所说的。我检查了日志,但从这些日志中看不出任何东西。是的,这和您上面写的是一样的。但是创建表和创建外部表有什么区别?以及当我删除此表时(创建外部表),它没有从HDFS中删除。外部意味着Hive不拥有数据,但表的元数据会在删除时删除。您能用查询的完整输出更新您的帖子吗?嗨,Michael,我已经发布了查询的完整输出。OMG,我非常关注Hive,直到现在才发现您的输入无效。无法重复JSON a中的键这就是加载阶段记录被折叠的原因。请使用诸如first之类的linter进行检查,以确保它是有效的JSON.table,而不是由代码创建的。它会引发空指针异常
hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar
hive> create EXTERNAL table employees_json (firstName string, lastName string, employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds
hive> load data local inpath '/home/hduser/pradi/employees.json' into table employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds
hive>select * from employees_json;
OK
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.194 seconds
CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
"field1"="$.field1",
"field2"="$.field2",
"field3"="$.field3",
"field4"="$.field4"
);
create table messages (
id int,
creation_date string,
text string,
loggedInUser STRUCT<id:INT, name: STRING>
)
row format serde "org.openx.data.jsonserde.JsonSerDe";
1 2020-03-01 I am on cotroller {"id":1,"name:"API"}
2 2020-04-01 I am on service {"id":1,"name:"API"}