Hadoop 用hive进行Xml解析

Hadoop 用hive进行Xml解析,hadoop,xml-parsing,hive,Hadoop,Xml Parsing,Hive,我正在使用HiveSerde进行XML解析并将其加载到hive。 示例XML内容: <records> <record customer_id="0000-JTALA"> <income>200000</income> <address type="M"> <Flatno>345</FlatNo> <Street>ABS</street> <city>QWW</city&

我正在使用HiveSerde进行XML解析并将其加载到hive。 示例XML内容:

<records>
<record customer_id="0000-JTALA">
<income>200000</income>
<address type="M">
<Flatno>345</FlatNo>
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>   
<address type="B">
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>    
</record>

<record customer_id="0001-JTALA">
<income>200000</income>
<address type="M">
<Flatno>45</FlatNo>
<Street>fgBS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>   
<address type="B">
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>325</pin>
</address>   
<address type="P">
<Street>ABS</street>
<city>QWW</city>
<country>UK</country>
<pin>325</pin>
</address>   
</record>
</records>
应该为每个地址创建一行。根据上述示例,第一个客户应创建2条记录,第二个客户应创建3条记录,共5条记录,根据我的当前代码,仅为单个客户创建了两条记录,在地址列中,所有地址都连接在一起,因此对于第一个客户街道列,第一个地址街道+第二个地址街道。 示例查询:

CREATE external TABLE msg_details(customer_id STRING, income BIGINT, AType String,Flatno String, Street string,city string,country string,pin string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.address_type"="/record/address/@type",
"column.xpath.Flatno"="/record/address/Flatno/text()",
"column.xpath.Street"="/record/address/Street/text()",
"column.xpath.city"="/record/address/city/text()",
"column.xpath.country"="/record/address/country/text()"
"column.xpath.pin"="/record/address/pin/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/root/serdeinput'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);

一种方法是编写用于XML解析的用户定义的自定义serdy。 [或] 编写UDF,将同一列中包含的数组值拆分为行

您正在使用的serde是泛型的,它几乎等同于hiveserde提供的xpath,两者都具有仅提取记录的有限特性

我尝试了使用横向视图和其他方法的3种其他方法,但对地址类型中的所有列都不起作用

唯一的解决方案是根据您的需求继续使用定制Serde进行解析

create external table msg_details3(customer_id string, income bigint, address_type Array<string>,Flatno Array<string>, Street ARRAY<string>,city ARRAY<string>,country ARRAY<string>,pin ARRAY<string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.address_type"="/record/address/@type",
"column.xpath.Flatno"="/record/address/Flatno/text()",
"column.xpath.Street"="/record/address/Street/text()",
"column.xpath.city"="/record/address/city/text()",
"column.xpath.country"="/record/address/country/text()",
"column.xpath.pin"="/record/address/pin/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location '/user/cloudera/data'
TBLPROPERTIES (
"xmlinput.start"="<record ",
"xmlinput.end"="</record>"
);

有人能帮我吗!!