Arrays 将列附加到pyspark数据帧中的数组_Arrays_Apache Spark_Pyspark_Append_Pyspark Dataframes

Arrays 将列附加到pyspark数据帧中的数组

arrays apache-spark pyspark

Arrays 将列附加到pyspark数据帧中的数组,arrays,apache-spark,pyspark,append,pyspark-dataframes,Arrays,Apache Spark,Pyspark,Append,Pyspark Dataframes,我有一个包含两列的数据框 | VPN | UPC | +--------+-----------------+ | 1 | [4,2] | | 2 | [1,2] | | null | [4,7] | 我需要一个结果列，该列的值vpn（string）附加到数组UPC。结果如下所示 | result | +--------+ | [4,2,1]| | [1,2,2]| | [4,7,

我有一个包含两列的数据框

| VPN    | UPC             |
+--------+-----------------+
| 1      | [4,2]           |
| 2      | [1,2]           |
| null   | [4,7]           |

我需要一个结果列，该列的值vpn（string）附加到数组UPC。结果如下所示

| result |
+--------+
| [4,2,1]|
| [1,2,2]|
| [4,7,] |

一个选项是使用+。首先使用

array

将

VPN

列转换为数组类型，然后使用

concat

方法连接两个数组列：

df=spark.createDataFrame（[（1[4,2]），（2[1,2]），（无，[4,7]），['VPN'，UPC']
df.show（）
+----+------+
|VPN | UPC|
+----+------+
|   1|[4, 2]|
|   2|[1, 2]|
|空|[4,7]|
+----+------+
df.selectExpr（'concat（UPC，array（VPN））作为结果'）.show（）
+---------+
|结果|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
|  [4, 7,]|
+---------+

或者更像蟒蛇：

from pyspark.sql.functions import array, concat

df.select(concat('UPC', array('VPN')).alias('result')).show()
+---------+
|   result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
|  [4, 7,]|
+---------+