Python Pyspark:需要连接多个数据帧,即第一条语句的输出应该与第三条数据帧连接,依此类推

Python Pyspark:需要连接多个数据帧,即第一条语句的输出应该与第三条数据帧连接,依此类推,python,pyspark,Python,Pyspark,例如:我有5个df(df1,df2,df3,df4,df5)。我应该能够用df2连接df1并存储在prevdf中,然后我应该能够用df3连接prevdf,以及用df4连接的结果,以此类推,如果我有3个数据帧,我可以连接,但不能用df4连接 在这方面的任何帮助都将不胜感激。我尝试了自己的方法,使用udf。只需妥善处理列名 if i >= length and i > 2: j=j+1 print i pri

例如:我有5个df(df1,df2,df3,df4,df5)。我应该能够用df2连接df1并存储在prevdf中,然后我应该能够用df3连接prevdf,以及用df4连接的结果,以此类推,如果我有3个数据帧,我可以连接,但不能用df4连接


在这方面的任何帮助都将不胜感激。

我尝试了自己的方法,使用udf。只需妥善处理列名

    if i >= length and i > 2:
            j=j+1
            print i
            print j
        #line.show()
            second="newdf{}" .format(j +1)
            first="newdf{}" .format(j)
            third="newdf{}" .format(j +2)
            print first
            print second
            print third
           # newdf1.show()
            print "one"
        #if (line == "/MTDSumOfCustomerInitiatedTrxn"):
            #first="enhanced_df{}" .format(i -1)
            #print first

            #first.show()
            #final=enhanced_df{}.join(enhanced_df{},'ENT_CUST_ID','outer') .format(i,i -1)


            #stat="{},{}" .format(first,second)
            #print stat
            b="prevDf=GenericFunctions.enhanced_customer({},{},'ENT_CUST_ID')" .format(second,first)
            print b
            exec(b)
            prevDf.show(i)

            c= "Finaldf=GenericFunctions.enhanced_customer(prevDf,{},'ENT_CUST_ID')" .format(third)

为什么你的代码有那么多的空行?不,这是我在放之前删除的注释部分,我的错,我应该把它格式化得更合适。谢谢Suresh,你能建议一种方法,如何在加入时消除重复的列。你只需从prevdf中选择你需要的所有列。
>>> l = [(1,2,3,4),(3,4,5,6)]
>>> df = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   2|   3|   4|
|   3|   4|   5|   6|
+----+----+----+----+

>>> l = [(1,7,8,9),(3,9,5,7)]
>>> df1 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df1.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   7|   8|   9|
|   3|   9|   5|   7|
+----+----+----+----+

>>> l = [(1,89,45,67),(3,23,34,56)]
>>> df2 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|  89|  45|  67|
|   3|  23|  34|  56|
+----+----+----+----+

>>> l = [(3,65,21,32),(1,87,64,35)]
>>> df3 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df3.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   3|  65|  21|  32|
|   1|  87|  64|  35|
+----+----+----+----+

>>> l = [(1,99,101,345),(3,67,53,21)]
>>> df4 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df4.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|  99| 101| 345|
|   3|  67|  53|  21|
+----+----+----+----+

>>> def join_udf(df0,*df):
...    for id,d in enumerate(df):
...        if id == 0:
...           prevdf = df0.join(d,'col1')
...        else:
...           prevdf = prevdf.join(d,'col1')
...    return prevdf
...
>>> jdf = join_udf(df,df1,df2,df3,df4)
>>> jdf.show()
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col2|col3|col4|col2|col3|col4|col2|col3|col4|col2|col3|col4|
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|   1|   2|   3|   4|   7|   8|   9|  89|  45|  67|  87|  64|  35|  99| 101| 345|
|   3|   4|   5|   6|   9|   5|   7|  23|  34|  56|  65|  21|  32|  67|  53|  21|
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+