Dataframe 需要使用alook up数据帧在数据帧上执行多列联接

Dataframe 需要使用alook up数据帧在数据帧上执行多列联接,dataframe,join,pyspark,Dataframe,Join,Pyspark,我有两个这样的数据帧 +---+---+---+---+---+ | c1| c2| c3| c4| c5| +---+---+---+---+---+ | 0| 1| 2| 3| 4| | 5| 6| 7| 8| 9| +---+---+---+---+---+ 我想用df2中的等效键查找df1上的每一列,并从df2返回每个列的查找值 下面是生成两个输入数据帧的代码 df1 = sc.parallelize([('0','1','2','3','4',), ('5','6

我有两个这样的数据帧

+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
|  0|  1|  2|  3|  4|
|  5|  6|  7|  8|  9|
+---+---+---+---+---+
我想用df2中的等效键查找df1上的每一列,并从df2返回每个列的查找值

下面是生成两个输入数据帧的代码

df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
                     ('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()


I want to join the above to produce the following

+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
|  0|  1|  2|  3|  4|A  |B  |C  |D  |E  |
|  5|  6|  7|  8|  9|F  |G  |H  |I  |J  |
+---+---+---+---+---+---+---+---+--+----+
我可以让它像这样为一个列工作,但我不知道如何将它扩展到所有列

df1.join(df2, df1.c1 == df2.key).select('c1','val').show()

+---+---+
| c1|val|
+---+---+
|  0|  A|
|  5|  F|
+---+---+

您可以仅链接联接:

df1
    .join(df2, on=df1.c1 == df2.key, how='left')
    .withColumnRenamed('val', 'lu1') \
    .join(df2, on=df1.c2 == df2.key, how='left) \
    .withColumnRenamed('val', 'lu2') \
    .etc
您甚至可以在循环中执行此操作,但不要使用太多的列:

from pyspark.sql import functions as f

df = df1
for i in range(1, 6):
    df = df \
        .join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
        .withColumnRenamed('val', 'lu{}'.format(i))

df \
    .select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
    .show()
输出

+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
|  5|  6|  7|  8|  9|  F|  G|  H|  I|  J|
|  0|  1|  2|  3|  4|  A|  B|  C|  D|  E|
+---+---+---+---+---+---+---+---+---+---+

谢谢你的帮助。您显示的循环代码给了我此错误,尽管名称“f”未定义回溯(最近一次调用上次):名称错误:名称“f”未定义我在尝试forst方法时还收到“u”检测到逻辑计划之间内部联接的隐式笛卡尔积”错误。你能用我在问题中提供的数据告诉我你的方法的输出吗?我有点太快了,这里没有spark env。更新的答案。对不起,劳伦斯,你的答案对我都不起作用。内部连接错误仍然存在,即使您更改了方法1的代码,打开一个数据库并删除了相当多的打字错误XD go,它现在运行
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
|  5|  6|  7|  8|  9|  F|  G|  H|  I|  J|
|  0|  1|  2|  3|  4|  A|  B|  C|  D|  E|
+---+---+---+---+---+---+---+---+---+---+