Python 如何在pyspark中解压列表类型的列_Python_Apache Spark_Pyspark_Apache Spark Sql

Python 如何在pyspark中解压列表类型的列

python apache-spark pyspark

Python 如何在pyspark中解压列表类型的列,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我在pyspark中有一个dataframe，df有一个类型为array string的列，因此我需要生成一个新列，该列具有列表的开头，并且我还需要其他列具有尾部列表的concat 这是我的原始数据帧： pyspark> df.show() +---+------------+ | id| lst_col| +---+------------+ | 1|[a, b, c, d]| +---+------------+ pyspark> df.printSchema()

我在pyspark中有一个dataframe，df有一个类型为array string的列，因此我需要生成一个新列，该列具有列表的开头，并且我还需要其他列具有尾部列表的concat

这是我的原始数据帧：

pyspark> df.show()
+---+------------+
| id|     lst_col|
+---+------------+
|  1|[a, b, c, d]|
+---+------------+


pyspark> df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- lst_col: array (nullable = true)
 |    |-- element: string (containsNull = true)

我需要生成如下内容：

pyspark> df2.show()
+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
|  1|       a|          b,c,d|
+---+--------+---------------+

对于Spark 2.4+，您可以对阵列使用

element_at

、

slice

和

size

功能：

df.select("id",
          element_at("lst_col", 1).alias("lst_head"),
          expr("slice(lst_col, 2, size(lst_col))").alias("lst_concat_tail")
         ).show()

给出：

+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
|  1|       a|      [b, c, d]|
+---+--------+---------------+

对于Spark 2.4+，您可以对阵列使用

element_at

、

slice

和

size

功能：

df.select("id",
          element_at("lst_col", 1).alias("lst_head"),
          expr("slice(lst_col, 2, size(lst_col))").alias("lst_concat_tail")
         ).show()

给出：

+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
|  1|       a|      [b, c, d]|
+---+--------+---------------+