Python PySpark数据帧的优先连接
假设我有两个PySpark数据帧Python PySpark数据帧的优先连接,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,假设我有两个PySpark数据帧 df1 | A | B | | ----- | -------------- | | foo | B1 | | bar | B2 | | baz | B3 | | lol | B9 | df2 | X | Y | Z | | ------ | -- | --------| | bar |
df1
| A | B |
| ----- | -------------- |
| foo | B1 |
| bar | B2 |
| baz | B3 |
| lol | B9 |
df2
| X | Y | Z |
| ------ | -- | --------|
| bar | B1 | Cool |
| foo | B2 | Awesome |
| val | B3 | Superb |
| bar | B4 | Nice |
如何将这些数据帧连接到df3
,以便
df1[“A”]
与df2[“X”]
进行优先连接,并从df2[“Z”]
获取值,
及df3[“Z”]
值为null
,请填写null
值
使用将df1[“B”]
与df2[“Y”]
连接并从df2[“Z”]
获取值所产生的值df4
结束,而不是df3
(注意df3中的null
值):
我的非简化现实世界示例有许多重复项、许多列等,因此我看不出一个简单的when/other语句是否足够(或者我完全迷路了……)。有什么建议吗?您可以尝试执行两个连接:
import pyspark.sql.functions as F
df4 = df1.join(
df2,
df1['A'] == df2['X'],
'left'
).select(
'A', 'B', 'Z'
).alias('df3').join(
df2.alias('df2'),
F.expr('df3.B = df2.Y and df3.Z is null'),
'left'
).select(
'A', 'B', F.coalesce('df3.z', 'df2.z').alias('z')
)
df4.show()
+---+---+-------+
| A| B| z|
+---+---+-------+
|foo| B1|Awesome|
|bar| B2| Nice|
|bar| B2| Cool|
|baz| B3| Superb|
|lol| B9| null|
+---+---+-------+
或者如果你只想加入1人
df4 = df1.join(
df2,
(df1['A'] == df2['X']) | (df1['B'] == df2['Y']),
'left'
).selectExpr(
'*',
'max(A = X) over(partition by A, B) as flag'
).filter(
'(flag and A = X) or not flag or flag is null'
).select(
'A','B','Z'
)
df4.show()
+---+---+-------+
| A| B| Z|
+---+---+-------+
|bar| B2| Cool|
|bar| B2| Nice|
|foo| B1|Awesome|
|lol| B9| null|
|baz| B3| Superb|
+---+---+-------+
df4 = df1.join(
df2,
(df1['A'] == df2['X']) | (df1['B'] == df2['Y']),
'left'
).selectExpr(
'*',
'max(A = X) over(partition by A, B) as flag'
).filter(
'(flag and A = X) or not flag or flag is null'
).select(
'A','B','Z'
)
df4.show()
+---+---+-------+
| A| B| Z|
+---+---+-------+
|bar| B2| Cool|
|bar| B2| Nice|
|foo| B1|Awesome|
|lol| B9| null|
|baz| B3| Superb|
+---+---+-------+