Python 如何连接两个Spark数据帧并操作它们的共享列？_Python_Pyspark_Pyspark Dataframes

Python 如何连接两个Spark数据帧并操作它们的共享列？

python pyspark

Python 如何连接两个Spark数据帧并操作它们的共享列？,python,pyspark,pyspark-dataframes,Python,Pyspark,Pyspark Dataframes,我有两个数据帧，如下所示： +--+-----------+ |id|some_string| +--+-----------+ | a| foo| | b| bar| | c| egg| | d| fog| +--+-----------+ +--+-----------+ |id|some_string| +--+-----------+ | a| foohoi| | b| barhei| | c| eggha

我有两个数据帧，如下所示：

+--+-----------+
|id|some_string|
+--+-----------+
| a|        foo|
| b|        bar|
| c|        egg|
| d|        fog|
+--+-----------+

+--+-----------+
|id|some_string|
+--+-----------+
| a|     foohoi|
| b|     barhei|
| c|     egghai|
| d|        fog|
| e|        hui|
+--+-----------+

这是：

+--+-----------+
|id|some_string|
+--+-----------+
| a|        hoi|
| b|        hei|
| c|        hai|
| e|        hui|
+--+-----------+

我想加入他们，像这样：

+--+-----------+
|id|some_string|
+--+-----------+
| a|        foo|
| b|        bar|
| c|        egg|
| d|        fog|
+--+-----------+

+--+-----------+
|id|some_string|
+--+-----------+
| a|     foohoi|
| b|     barhei|
| c|     egghai|
| d|        fog|
| e|        hui|
+--+-----------+

因此，来自第一个数据帧的列

some_string

与来自第二个数据帧的列

some_string

相连。如果我正在使用

df_join=df1.join（df2，on='id'，how='outer'）

它会回来的

+--+-----------+-----------+
|id|some_string|some_string|
+--+-----------+-----------+
| a|        foo|        hoi|
| b|        bar|        hei|
| c|        egg|        hai|
| d|        fog|       null|
| e|       null|        hui|
+--+-----------+-----------+

有办法吗？

考虑到要执行外部联接，可以尝试以下操作：

from pyspark.sql.functions import concat, col, lit, when


df_join= df1.join(df2,on='id',how='outer').when(isnull(df1.some_string1), ''). when(isnull(df2.some_string2),'').withColumn('new_column',concat(col('some_string1'),lit(''),col('some_string2'))).select('id','new_column')

（请注意，some_string1和2指的是df1和df2数据帧中的some_string列。我建议您以不同的方式命名它们，而不是使用相同的名称some_string，以便您可以调用它们）

为了实现正确的连接，您需要使用。除此之外，您使用

outer

join的方式几乎是正确的

您需要检查这两列中是否有一列是或，然后执行以下操作

null的值不符合您需要在此处使用when子句的要求