Xml 将WrappedArray拆分为多行和多列_Xml_Scala_Apache Spark_Databricks

Xml 将WrappedArray拆分为多行和多列

xml scala apache-spark

Xml 将WrappedArray拆分为多行和多列,xml,scala,apache-spark,databricks,Xml,Scala,Apache Spark,Databricks,我是斯卡拉的新手。我正试图拆散一个包裹，但没有成功。我有一个数据框，其中包含一行我从xml转换的数据如果我运行df.printSchema我会得到： root |-- WrappedArray: struct (nullable = true) | |-- Response: struct (nullable = true) | | |-- Result: struct (nullable = true) | | | |-- Cols: array

我是斯卡拉的新手。我正试图拆散一个包裹，但没有成功。我有一个数据框，其中包含一行我从xml转换的数据

如果我运行

df.printSchema

我会得到：

root
 |-- WrappedArray: struct (nullable = true)
 |    |-- Response: struct (nullable = true)
 |    |    |-- Result: struct (nullable = true)
 |    |    |    |-- Cols: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- col1: string (nullable = true)
 |    |    |    |    |    |-- col2: string (nullable = true)
 |    |    |    |    |    |-- col3: string (nullable = true)
 |    |    |    |    |    |-- col4: string (nullable = true)
 |    |    |    |    |    |-- col5: long (nullable = true)
 |    |    |-- _xmlns: string (nullable = true)

 [[[[WrappedArray([1,2019-11-29T00:00:00,06:00,1 Center1,55]
 , [2,2020-03-28T00:00:00,06:00,2 Center2,57]
 , [3,2020-07-01T00:00:00,06:00,3 Center3,58])],https://centers.net/]]]

如果我运行

df.head（）

我会得到：

root
 |-- WrappedArray: struct (nullable = true)
 |    |-- Response: struct (nullable = true)
 |    |    |-- Result: struct (nullable = true)
 |    |    |    |-- Cols: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- col1: string (nullable = true)
 |    |    |    |    |    |-- col2: string (nullable = true)
 |    |    |    |    |    |-- col3: string (nullable = true)
 |    |    |    |    |    |-- col4: string (nullable = true)
 |    |    |    |    |    |-- col5: long (nullable = true)
 |    |    |-- _xmlns: string (nullable = true)

 [[[[WrappedArray([1,2019-11-29T00:00:00,06:00,1 Center1,55]
 , [2,2020-03-28T00:00:00,06:00,2 Center2,57]
 , [3,2020-07-01T00:00:00,06:00,3 Center3,58])],https://centers.net/]]]

我希望得到一个具有5列的数据帧，如下所示：

col1   col2                  col3    col4         col5
1      2019-11-29T00:00:00   06:00   1 Center1    55
2      2020-03-28T00:00:00   06:00   2 Center2    57
3      2020-07-01T00:00:00   06:00   3 Center3    58

import org.apache.spark.sql.functions.{col, explode}    

df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
  .select(
    col("exploded.col1").as("col1"),
    col("exploded.col2").as("col2"),
    col("exploded.col3").as("col3"),
    col("exploded.col4").as("col4"),
    col("exploded.col5").as("col5")
  )

我在StackOverflow上看到过很多类似于我的帖子，但是案例有点不同，因为wrapArrays已经被分成多行。我已经尝试过（即collection.mutable.WrappedArray）调整它以适应我的情况，但我对scala是新手，这对我来说是非常困难的

您能帮我一下吗？

您可以使用spark dataframe dsl这样做：

col1   col2                  col3    col4         col5
1      2019-11-29T00:00:00   06:00   1 Center1    55
2      2020-03-28T00:00:00   06:00   2 Center2    57
3      2020-07-01T00:00:00   06:00   3 Center3    58

import org.apache.spark.sql.functions.{col, explode}    

df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
  .select(
    col("exploded.col1").as("col1"),
    col("exploded.col2").as("col2"),
    col("exploded.col3").as("col3"),
    col("exploded.col4").as("col4"),
    col("exploded.col5").as("col5")
  )

这将分解模式中的数组，为每个元素创建一行，然后从数组元素中选择每个列字段到自己的列中。

您可以使用spark dataframe dsl这样做：

col1   col2                  col3    col4         col5
1      2019-11-29T00:00:00   06:00   1 Center1    55
2      2020-03-28T00:00:00   06:00   2 Center2    57
3      2020-07-01T00:00:00   06:00   3 Center3    58

import org.apache.spark.sql.functions.{col, explode}    

df.withColumn("exploded", explode(col("WrappedArray.Response.Result.Cols")))
  .select(
    col("exploded.col1").as("col1"),
    col("exploded.col2").as("col2"),
    col("exploded.col3").as("col3"),
    col("exploded.col4").as("col4"),
    col("exploded.col5").as("col5")
  )

这将分解模式中的数组，为每个元素创建一行，然后从数组元素中选择每个col字段到自己的列中