在Spark和Scala中读取XML
我有下面的XML,我正在使用scala和spark阅读它在Spark和Scala中读取XML,xml,scala,apache-spark,databricks,Xml,Scala,Apache Spark,Databricks,我有下面的XML,我正在使用scala和spark阅读它 <TABLES> <TABLE attrname="Red"> <ROWDATA> <ROW Type="solid" track="0" Unit="0"/> </ROWDATA> </TABLE> <
<TABLES>
<TABLE attrname="Red">
<ROWDATA>
<ROW Type="solid" track="0" Unit="0"/>
</ROWDATA>
</TABLE>
<TABLE attrname="Blue">
<ROWDATA>
<ROW Type="light" track="0" Unit="0"/>
<ROW Type="solid" track="0" Unit="0"/>
<ROW Type="solid" track="0" Unit="0"/>
</ROWDATA>
</TABLE>
检查下面的代码
val xmlDFF = session.read
.option("rowTag", "TABLES")
.xml(filePath)
.withColumn("TABLE",explode_outer($"TABLE"))
.select(
row_number().over(Window.partitionBy(lit(1)).orderBy(lit(1))).as("obj_id"),
$"TABLE.*",
explode_outer($"TABLE.ROWDATA.ROW").as("row")
)
.select($"obj_id",$"_attrname",explode_outer(array(
struct(
lit("Type").as("Name"),
$"row._Type".as("Value")
),
struct(
lit("track").as("Name"),
$"row._track".as("Value")
),
struct(
lit("Unit").as("Name"),
$"row._Unit".as("Value")
)
)
).as("row"))
.select(
$"obj_id",
$"_attrname".as("Type"),
$"row.*"
)
.orderBy($"obj_id")
.show(false)
输出
+------+----+-----+-----+
|obj_id|Type|Name |Value|
+------+----+-----+-----+
|1 |Red |track|0 |
|1 |Red |Type |solid|
|1 |Red |Unit |0 |
|2 |Blue|Type |light|
|2 |Blue|Unit |0 |
|2 |Blue|track|0 |
|3 |Blue|Unit |0 |
|3 |Blue|Type |solid|
|3 |Blue|track|0 |
|4 |Blue|Type |solid|
|4 |Blue|track|0 |
|4 |Blue|Unit |0 |
+------+----+-----+-----+
obj_id列值是如何派生的?它是基于每一行派生的,就像每一行都有一个唯一的数字。请检查下面的答案。
+------+----+-----+-----+
|obj_id|Type|Name |Value|
+------+----+-----+-----+
|1 |Red |track|0 |
|1 |Red |Type |solid|
|1 |Red |Unit |0 |
|2 |Blue|Type |light|
|2 |Blue|Unit |0 |
|2 |Blue|track|0 |
|3 |Blue|Unit |0 |
|3 |Blue|Type |solid|
|3 |Blue|track|0 |
|4 |Blue|Type |solid|
|4 |Blue|track|0 |
|4 |Blue|Unit |0 |
+------+----+-----+-----+