将数组结构拆分为单值列Spark scala_Scala_Apache Spark_Pyspark_Apache Spark Sql_Explode

将数组结构拆分为单值列Spark scala

scala apache-spark pyspark

将数组结构拆分为单值列Spark scala,scala,apache-spark,pyspark,apache-spark-sql,explode,Scala,Apache Spark,Pyspark,Apache Spark Sql,Explode,我有一个数据框，其中包含一个数组结构列，我希望拆分嵌套值，并将其添加为逗号分隔的字符串新列数据帧示例：测验预期结果数据帧 tests tests_id tests_name [id:1,name:foo],[id:2,name:bar] 1, 2 foo, bar 我尝试了下面的代码，但出现了一个错误 df.withColumn("tests_name", concat_ws(","

我有一个数据框，其中包含一个数组结构列，我希望拆分嵌套值，并将其添加为逗号分隔的字符串新列数据帧示例：测验

预期结果数据帧

tests                            tests_id  tests_name
[id:1,name:foo],[id:2,name:bar]  1, 2     foo, bar

我尝试了下面的代码，但出现了一个错误

df.withColumn("tests_name", concat_ws(",", explode(col("tests.name"))))

错误：

org.apache.spark.sql.AnalysisException: Generators are not supported when it's nested in expressions, but got: concat_ws(,, explode(tests.name AS `name`));

取决于您使用的Spark版本。假设数据帧方案如下

root
 |-- test: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)

Spark 3.0.0

df.withColumn("id", concat_ws(",", transform($"test", x => x.getField("id"))))
  .withColumn("name", concat_ws(",", transform($"test", x => x.getField("name"))))
.show(false)

Spark 2.4.0+

df.withColumn("id", concat_ws(",", expr("transform(test, x -> x.id)")))
.withColumn("name", concat_ws(",", expr("transform(test, x -> x.name)")))
.show(false)

火花<2.4

val extract_id = udf((test: Seq[Row]) => test.map(_.getAs[Long]("id")))
val extract_name = udf((test: Seq[Row]) => test.map(_.getAs[String]("name")))

df.withColumn("id", concat_ws(",", extract_id($"test")))
  .withColumn("name", concat_ws(",", extract_name($"test")))
  .show(false)

输出：

+--------------------+---+-------+
|test                |id |name   |
+--------------------+---+-------+
|[[1, foo], [2, bar]]|1,2|foo,bar|
|[[3, foo], [4, bar]]|3,4|foo,bar|
+--------------------+---+-------+

取决于您使用的Spark版本。假设数据帧方案如下

root
 |-- test: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)

Spark 3.0.0

df.withColumn("id", concat_ws(",", transform($"test", x => x.getField("id"))))
  .withColumn("name", concat_ws(",", transform($"test", x => x.getField("name"))))
.show(false)

Spark 2.4.0+

df.withColumn("id", concat_ws(",", expr("transform(test, x -> x.id)")))
.withColumn("name", concat_ws(",", expr("transform(test, x -> x.name)")))
.show(false)

火花<2.4

val extract_id = udf((test: Seq[Row]) => test.map(_.getAs[Long]("id")))
val extract_name = udf((test: Seq[Row]) => test.map(_.getAs[String]("name")))

df.withColumn("id", concat_ws(",", extract_id($"test")))
  .withColumn("name", concat_ws(",", extract_name($"test")))
  .show(false)

输出：

+--------------------+---+-------+
|test                |id |name   |
+--------------------+---+-------+
|[[1, foo], [2, bar]]|1,2|foo,bar|
|[[3, foo], [4, bar]]|3,4|foo,bar|
+--------------------+---+-------+

你能分享dataframe的模式吗？你用的是什么版本的spark？你能分享dataframe的模式吗？你用的是什么版本的spark？