Apache spark spark数据帧:分解列表列

Apache spark spark数据帧:分解列表列,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我得到了Spark聚合器的一个输出,它是List[Character] 所以我的数据框看起来像: +-----------------------------------------------+ | value | +-----------------------------------------------+ |[[harry, potter, gryffindor],[ron, weasley ... | +

我得到了Spark聚合器的一个输出,它是List[Character]

所以我的数据框看起来像:

+-----------------------------------------------+
|               value                           |
+-----------------------------------------------+
|[[harry, potter, gryffindor],[ron, weasley ... |
+-----------------------------------------------+
现在我想把它转换成

+----------------------------------+
| name  | second_name | faculty    |
+----------------------------------+
| harry | potter      | gryffindor |
| ron   | weasley     | gryffindor |

如何正确执行此操作?

可以使用分解和拆分数据帧函数来完成此操作

以下是一个例子:

>>> df = spark.createDataFrame([[[['a','b','c'], ['d','e','f'], ['g','h','i']]]],["col1"])
>>> df.show(20, False)
+---------------------------------------------------------------------+
|col1                                                                 |
+---------------------------------------------------------------------+
|[WrappedArray(a, b, c), WrappedArray(d, e, f), WrappedArray(g, h, i)]|
+---------------------------------------------------------------------+

>>> from pyspark.sql.functions import explode
>>> out_df = df.withColumn("col2", explode(df.col1)).drop('col1')
>>>
>>> out_df .show()
+---------+
|     col2|
+---------+
|[a, b, c]|
|[d, e, f]|
|[g, h, i]|
+---------+

>>> out_df.select(out_df.col2[0].alias('c1'), out_df.col2[1].alias('c2'), out_df.col2[2].alias('c3')).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  g|  h|  i|
+---+---+---+

>>>

您想要一个名称列表,然后是第二个名称列表,然后是教员,还是想要列表中的每个值都有一行?我想要将列表转换为表,更新的输出可能会有帮助,谢谢!让我试试。我们能看看someDF的模式吗?看看当前的场景,我想您需要分解和拆分函数
>>> df = spark.createDataFrame([[[['a','b','c'], ['d','e','f'], ['g','h','i']]]],["col1"])
>>> df.show(20, False)
+---------------------------------------------------------------------+
|col1                                                                 |
+---------------------------------------------------------------------+
|[WrappedArray(a, b, c), WrappedArray(d, e, f), WrappedArray(g, h, i)]|
+---------------------------------------------------------------------+

>>> from pyspark.sql.functions import explode
>>> out_df = df.withColumn("col2", explode(df.col1)).drop('col1')
>>>
>>> out_df .show()
+---------+
|     col2|
+---------+
|[a, b, c]|
|[d, e, f]|
|[g, h, i]|
+---------+

>>> out_df.select(out_df.col2[0].alias('c1'), out_df.col2[1].alias('c2'), out_df.col2[2].alias('c3')).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  g|  h|  i|
+---+---+---+

>>>