Dataframe 如何在pyspark中连接两个数组
我有一个pyspark数据帧 例如:Dataframe 如何在pyspark中连接两个数组,dataframe,pyspark,Dataframe,Pyspark,我有一个pyspark数据帧 例如: ID | phone | name <array> | age <array> ------------------------------------------------- 12 | 827556 | ['AB','AA'] | ['CC'] ------------------------------------------------- 45 | 87346 | null
ID | phone | name <array> | age <array>
-------------------------------------------------
12 | 827556 | ['AB','AA'] | ['CC']
-------------------------------------------------
45 | 87346 | null | ['DD']
-------------------------------------------------
56 | 98356 | ['FF'] | null
-------------------------------------------------
34 | 87345 | ['AA','BB'] | ['BB']
但是我得到了一些缺少的列,似乎concat函数
对字符串而不是数组起作用,并删除重复项:
预期结果:
ID | phone | new_column <array>
----------------------------------------
12 | 827556 | ['AB','AA','CC']
----------------------------------------
45 | 87346 | ['DD']
----------------------------------------
56 | 98356 | ['FF']
----------------------------------------
34 | 87345 | ['AA','BB']
----------------------------------------
ID |电话|新|栏
----------------------------------------
12 | 827556 |['AB','AA','CC']
----------------------------------------
45 | 87346 |[DD']
----------------------------------------
56 | 98356 |[FF']
----------------------------------------
34 | 87345 |['AA','BB']
----------------------------------------
知道我使用的是Spark版本<2.4
谢谢这会有帮助吗:
from pyspark.sql.functions import col, concat
testdata = [(0, ['a','b','d'], ['a2','b2','d2']), (1, ['c'], ['c2']), (2, ['d','e'],['d2','e2'])]
df = spark.createDataFrame(testdata, ['id', 'codes', 'codes2'])
df2 = df.withColumn("new_column",concat(col("codes"), col("codes2")))
连接后,结果为:
+---+---------+------------+--------------------+
| id| codes | codes2 | new_column |
+---+---------+------------+--------------------+
| 0 |[a, b, d]|[a2, b2, d2]|[a, b, d, a2, b2,...|
| 1 |[c] |[c2] |[c, c2] |
| 2 |[d, e] |[d2, e2] |[d, e, d2, e2] |
+---+---------+------------+--------------------+
关于spark<2.4,我们需要一个udf来连接阵列。希望这有帮助
from pyspark.sql import functions as F
from pyspark.sql.types import *
df = spark.createDataFrame([('a',['AA','AB'],['BC']),('b',None,['CB']),('c',['AB','BA'],None),('d',['AB','BB'],['BB'])],['c1','c2','c3'])
df.show()
+---+--------+----+
| c1| c2 | c3 |
+---+--------+----+
| a|[AA, AB] |[BC]|
| b| null |[CB]|
| c|[AB, BA] |null|
| d|[AB, BB] |[BB]|
+---+--------+----+
## changing null to empty array
df = df.withColumn('c2',F.coalesce(df.c2,F.array())).withColumn('c3',F.coalesce(df.c3,F.array()))
df.show()
+---+--------+----+
| c1| c2 | c3 |
+---+--------+----+
| a|[AA, AB] |[BC]|
| b| [] |[CB]|
| c|[AB, BA] | [] |
| d|[AB, BB] |[BB]|
+---+--------+----+
## UDF to concat the columns and remove the duplicates
udf1 = F.udf(lambda x,y: list(dict.fromkeys(x+y)), ArrayType(StringType()))
df = df.withColumn('concatd',udf1(df.c2,df.c3))
df.show()
+---+--------+----+------------+
| c1| c2 | c3 | concatd |
+---+--------+----+------------+
| a|[AA, AB] |[BC]|[AA, AB, BC]|
| b| [] |[CB]| [CB] |
| c|[AB, BA] | [] | [AB, BA] |
| d|[AB, BB] |[BB]| [AB, BB] |
+---+--------+----+------------+
不使用UDF的火花解决方案(火花<2.4),如下所示
import pyspark.sql.functions as F
testdata = [(0, ['AB','AA'], ['CC']), (1, None, ['DD']), (2, ['FF'] ,None), (3, ['AA','BB'] , ['BB'])]
df = spark.createDataFrame(testdata, ['id', 'name', 'age'])
df.show()
+---+--------+----+
| id| name| age|
+---+--------+----+
| 0|[AB, AA]|[CC]|
| 1| null|[DD]|
| 2| [FF]|null|
| 3|[AA, BB]|[BB]|
+---+--------+----+
df = df.withColumn('name', F.concat_ws(',', 'name'))
df = df.withColumn('age', F.concat_ws(',', 'age'))
df = df.withColumn("new_column",F.concat_ws(',', df.name, df.age))
df = df.withColumn("new_column",F.regexp_replace(df.new_column, "^,", ''))
df = df.withColumn("new_column",F.regexp_replace(df.new_column, "\,$", ''))
df.withColumn("new_column",F.split(df.new_column, ",")).show(5, False)
+---+-----+---+------------+
|id |name |age|new_column |
+---+-----+---+------------+
|0 |AB,AA|CC |[AB, AA, CC]|
|1 | |DD |[DD] |
|2 |FF | |[FF] |
|3 |AA,BB|BB |[AA, BB, BB]|
+---+-----+---+------------+
您也可以使用
selectExpr
testdata = [(0, ['AB','AA'], ['CC']), (1, None, ['DD']), (2, ['FF'] ,None), (3, ['AA','BB'] , ['BB'])]
df = spark.createDataFrame(testdata, ['id', 'name', 'age'])
>>> df.show()
+---+--------+----+
| id| name| age|
+---+--------+----+
| 0|[AB, AA]|[CC]|
| 1| null|[DD]|
| 2| [FF]|null|
| 3|[AA, BB]|[BB]|
+---+--------+----+
>>> df.selectExpr('''array(concat_ws(',',name,age)) as joined''').show()
+----------+
| joined|
+----------+
|[AB,AA,CC]|
| [DD]|
| [FF]|
|[AA,BB,BB]|
+----------+
此外,您可以查看此帖子,若要获取副本,您的代码可以更改为
lambda x,y:x+y
,您可以尝试NVL,或选择EXPR来处理数组中的空数据。请参阅此以了解更多信息:可能重复的
testdata = [(0, ['AB','AA'], ['CC']), (1, None, ['DD']), (2, ['FF'] ,None), (3, ['AA','BB'] , ['BB'])]
df = spark.createDataFrame(testdata, ['id', 'name', 'age'])
>>> df.show()
+---+--------+----+
| id| name| age|
+---+--------+----+
| 0|[AB, AA]|[CC]|
| 1| null|[DD]|
| 2| [FF]|null|
| 3|[AA, BB]|[BB]|
+---+--------+----+
>>> df.selectExpr('''array(concat_ws(',',name,age)) as joined''').show()
+----------+
| joined|
+----------+
|[AB,AA,CC]|
| [DD]|
| [FF]|
|[AA,BB,BB]|
+----------+