Python PySpark数据帧的优先连接

Python PySpark数据帧的优先连接,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,假设我有两个PySpark数据帧 df1 | A | B | | ----- | -------------- | | foo | B1 | | bar | B2 | | baz | B3 | | lol | B9 | df2 | X | Y | Z | | ------ | -- | --------| | bar |

假设我有两个PySpark数据帧

df1
| A     | B              |
| ----- | -------------- |
| foo   | B1             |
| bar   | B2             |
| baz   | B3             |
| lol   | B9            |

df2
| X      | Y  | Z       |
| ------ | -- | --------|
| bar    | B1 | Cool    |
| foo    | B2 | Awesome |
| val    | B3 | Superb  |
| bar    | B4 | Nice    |
如何将这些数据帧连接到
df3
,以便

  • df1[“A”]
    df2[“X”]
    进行优先连接,并从
    df2[“Z”]
    获取值, 及
  • 如果任何
    df3[“Z”]
    值为
    null
    ,请填写
    null
    值 使用将
    df1[“B”]
    df2[“Y”]
    连接并从
    df2[“Z”]
    获取值所产生的值
  • 例如,我希望以
    df4
    结束,而不是
    df3
    (注意df3中的
    null
    值):


    我的非简化现实世界示例有许多重复项、许多列等,因此我看不出一个简单的when/other语句是否足够(或者我完全迷路了……)。有什么建议吗?

    您可以尝试执行两个连接:

    import pyspark.sql.functions as F
    
    df4 = df1.join(
        df2,
        df1['A'] == df2['X'],
        'left'
    ).select(
        'A', 'B', 'Z'
    ).alias('df3').join(
        df2.alias('df2'),
        F.expr('df3.B = df2.Y and df3.Z is null'),
        'left'
    ).select(
        'A', 'B', F.coalesce('df3.z', 'df2.z').alias('z')
    )
    
    df4.show()
    +---+---+-------+
    |  A|  B|      z|
    +---+---+-------+
    |foo| B1|Awesome|
    |bar| B2|   Nice|
    |bar| B2|   Cool|
    |baz| B3| Superb|
    |lol| B9|   null|
    +---+---+-------+
    
    或者如果你只想加入1人

    df4 = df1.join(
        df2,
        (df1['A'] == df2['X']) | (df1['B'] == df2['Y']), 
        'left'
    ).selectExpr(
        '*',
        'max(A = X) over(partition by A, B) as flag'
    ).filter(
        '(flag and A = X) or not flag or flag is null'
    ).select(
        'A','B','Z'
    )
    
    df4.show()
    +---+---+-------+
    |  A|  B|      Z|
    +---+---+-------+
    |bar| B2|   Cool|
    |bar| B2|   Nice|
    |foo| B1|Awesome|
    |lol| B9|   null|
    |baz| B3| Superb|
    +---+---+-------+
    
    df4 = df1.join(
        df2,
        (df1['A'] == df2['X']) | (df1['B'] == df2['Y']), 
        'left'
    ).selectExpr(
        '*',
        'max(A = X) over(partition by A, B) as flag'
    ).filter(
        '(flag and A = X) or not flag or flag is null'
    ).select(
        'A','B','Z'
    )
    
    df4.show()
    +---+---+-------+
    |  A|  B|      Z|
    +---+---+-------+
    |bar| B2|   Cool|
    |bar| B2|   Nice|
    |foo| B1|Awesome|
    |lol| B9|   null|
    |baz| B3| Superb|
    +---+---+-------+