Apache spark pyspark中结构的展平数组

Apache spark pyspark中结构的展平数组,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我使用spark XML包将一个XML文件转换为数据帧。数据帧具有以下结构: root |-- results: struct (nullable = true) | |-- result: struct (nullable = true) | | |-- categories: struct (nullable = true) | | | |-- category: array (nullable = true) | | | |

我使用spark XML包将一个XML文件转换为数据帧。数据帧具有以下结构:

root
 |-- results: struct (nullable = true)
 |    |-- result: struct (nullable = true)
 |    |    |-- categories: struct (nullable = true)
 |    |    |    |-- category: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- value: string (nullable = true)
如果选择“类别”列(可能在“类别”下多次出现):

对于一条记录,结果如下所示

[[result1], [result2]]
我试图将结果平复:

[result1, result2]
使用展平功能时,会收到一条错误消息:

df.select(flatten(col('results.result.categories.category')).alias("Hits_Category"))
 cannot resolve 'flatten(`results`.`result`.`categories`.`category`)' due to data type mismatch: The argument should be an array of arrays, but '`results`.`result`.`categories`.`category`' is of array<struct<value:string>
df.select(展平(col('results.result.categories.categories'))。别名(“Hits\u category”))
由于数据类型不匹配,无法解析“展平(`results`.`result`.`categories`.`category`)”:参数应为数组数组,但“`results`.`result`.`categories`.`categories`.`categories`”为数组您试图对结构数组应用函数,而它需要数组:

展平(arrayOfArrays)
-将阵列阵列转换为单个阵列

您不需要自定义项,只需将数组元素从结构转换到数组,然后使用
flatten

大概是这样的:

df.select(col('results.result.categories.category').alias("result_categories"))\
  .withColumn("result_categories", expr("transform(result_categories, x -> array(x.*))"))\
  .select(flatten(col("result_categories")).alias("Hits_Category"))\
  .show()
df.select(col('results.result.categories.category').alias("result_categories"))\
  .withColumn("result_categories", expr("transform(result_categories, x -> array(x.*))"))\
  .select(flatten(col("result_categories")).alias("Hits_Category"))\
  .show()