Xml 将WrappedArray拆分为多行和多列
我是斯卡拉的新手。我正试图拆散一个包裹,但没有成功。我有一个数据框,其中包含一行我从xml转换的数据 如果我运行Xml 将WrappedArray拆分为多行和多列,xml,scala,apache-spark,databricks,Xml,Scala,Apache Spark,Databricks,我是斯卡拉的新手。我正试图拆散一个包裹,但没有成功。我有一个数据框,其中包含一行我从xml转换的数据 如果我运行df.printSchema我会得到: root |-- WrappedArray: struct (nullable = true) | |-- Response: struct (nullable = true) | | |-- Result: struct (nullable = true) | | | |-- Cols: array
df.printSchema
我会得到:
root
|-- WrappedArray: struct (nullable = true)
| |-- Response: struct (nullable = true)
| | |-- Result: struct (nullable = true)
| | | |-- Cols: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- col1: string (nullable = true)
| | | | | |-- col2: string (nullable = true)
| | | | | |-- col3: string (nullable = true)
| | | | | |-- col4: string (nullable = true)
| | | | | |-- col5: long (nullable = true)
| | |-- _xmlns: string (nullable = true)
[[[[WrappedArray([1,2019-11-29T00:00:00,06:00,1 Center1,55]
, [2,2020-03-28T00:00:00,06:00,2 Center2,57]
, [3,2020-07-01T00:00:00,06:00,3 Center3,58])],https://centers.net/]]]
如果我运行df.head()
我会得到:
root
|-- WrappedArray: struct (nullable = true)
| |-- Response: struct (nullable = true)
| | |-- Result: struct (nullable = true)
| | | |-- Cols: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- col1: string (nullable = true)
| | | | | |-- col2: string (nullable = true)
| | | | | |-- col3: string (nullable = true)
| | | | | |-- col4: string (nullable = true)
| | | | | |-- col5: long (nullable = true)
| | |-- _xmlns: string (nullable = true)
[[[[WrappedArray([1,2019-11-29T00:00:00,06:00,1 Center1,55]
, [2,2020-03-28T00:00:00,06:00,2 Center2,57]
, [3,2020-07-01T00:00:00,06:00,3 Center3,58])],https://centers.net/]]]
我希望得到一个具有5列的数据帧,如下所示:
col1 col2 col3 col4 col5
1 2019-11-29T00:00:00 06:00 1 Center1 55
2 2020-03-28T00:00:00 06:00 2 Center2 57
3 2020-07-01T00:00:00 06:00 3 Center3 58
import org.apache.spark.sql.functions.{col, explode}
df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
.select(
col("exploded.col1").as("col1"),
col("exploded.col2").as("col2"),
col("exploded.col3").as("col3"),
col("exploded.col4").as("col4"),
col("exploded.col5").as("col5")
)
我在StackOverflow上看到过很多类似于我的帖子,但是案例有点不同,因为wrapArrays已经被分成多行。我已经尝试过(即collection.mutable.WrappedArray)调整它以适应我的情况,但我对scala是新手,这对我来说是非常困难的
您能帮我一下吗?您可以使用spark dataframe dsl这样做:
col1 col2 col3 col4 col5
1 2019-11-29T00:00:00 06:00 1 Center1 55
2 2020-03-28T00:00:00 06:00 2 Center2 57
3 2020-07-01T00:00:00 06:00 3 Center3 58
import org.apache.spark.sql.functions.{col, explode}
df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
.select(
col("exploded.col1").as("col1"),
col("exploded.col2").as("col2"),
col("exploded.col3").as("col3"),
col("exploded.col4").as("col4"),
col("exploded.col5").as("col5")
)
这将分解模式中的数组,为每个元素创建一行,然后从数组元素中选择每个列字段到自己的列中。您可以使用spark dataframe dsl这样做:
col1 col2 col3 col4 col5
1 2019-11-29T00:00:00 06:00 1 Center1 55
2 2020-03-28T00:00:00 06:00 2 Center2 57
3 2020-07-01T00:00:00 06:00 3 Center3 58
import org.apache.spark.sql.functions.{col, explode}
df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
.select(
col("exploded.col1").as("col1"),
col("exploded.col2").as("col2"),
col("exploded.col3").as("col3"),
col("exploded.col4").as("col4"),
col("exploded.col5").as("col5")
)
这将分解模式中的数组,为每个元素创建一行,然后从数组元素中选择每个col字段到自己的列中