Python Pyspark:需要连接多个数据帧,即第一条语句的输出应该与第三条数据帧连接,依此类推
例如:我有5个df(df1,df2,df3,df4,df5)。我应该能够用df2连接df1并存储在prevdf中,然后我应该能够用df3连接prevdf,以及用df4连接的结果,以此类推,如果我有3个数据帧,我可以连接,但不能用df4连接Python Pyspark:需要连接多个数据帧,即第一条语句的输出应该与第三条数据帧连接,依此类推,python,pyspark,Python,Pyspark,例如:我有5个df(df1,df2,df3,df4,df5)。我应该能够用df2连接df1并存储在prevdf中,然后我应该能够用df3连接prevdf,以及用df4连接的结果,以此类推,如果我有3个数据帧,我可以连接,但不能用df4连接 在这方面的任何帮助都将不胜感激。我尝试了自己的方法,使用udf。只需妥善处理列名 if i >= length and i > 2: j=j+1 print i pri
在这方面的任何帮助都将不胜感激。我尝试了自己的方法,使用udf。只需妥善处理列名
if i >= length and i > 2:
j=j+1
print i
print j
#line.show()
second="newdf{}" .format(j +1)
first="newdf{}" .format(j)
third="newdf{}" .format(j +2)
print first
print second
print third
# newdf1.show()
print "one"
#if (line == "/MTDSumOfCustomerInitiatedTrxn"):
#first="enhanced_df{}" .format(i -1)
#print first
#first.show()
#final=enhanced_df{}.join(enhanced_df{},'ENT_CUST_ID','outer') .format(i,i -1)
#stat="{},{}" .format(first,second)
#print stat
b="prevDf=GenericFunctions.enhanced_customer({},{},'ENT_CUST_ID')" .format(second,first)
print b
exec(b)
prevDf.show(i)
c= "Finaldf=GenericFunctions.enhanced_customer(prevDf,{},'ENT_CUST_ID')" .format(third)
为什么你的代码有那么多的空行?不,这是我在放之前删除的注释部分,我的错,我应该把它格式化得更合适。谢谢Suresh,你能建议一种方法,如何在加入时消除重复的列。你只需从prevdf中选择你需要的所有列。
>>> l = [(1,2,3,4),(3,4,5,6)]
>>> df = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 4|
| 3| 4| 5| 6|
+----+----+----+----+
>>> l = [(1,7,8,9),(3,9,5,7)]
>>> df1 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df1.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 7| 8| 9|
| 3| 9| 5| 7|
+----+----+----+----+
>>> l = [(1,89,45,67),(3,23,34,56)]
>>> df2 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 89| 45| 67|
| 3| 23| 34| 56|
+----+----+----+----+
>>> l = [(3,65,21,32),(1,87,64,35)]
>>> df3 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df3.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 3| 65| 21| 32|
| 1| 87| 64| 35|
+----+----+----+----+
>>> l = [(1,99,101,345),(3,67,53,21)]
>>> df4 = spark.createDataFrame(l,['col1','col2','col3','col4'])
>>> df4.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 99| 101| 345|
| 3| 67| 53| 21|
+----+----+----+----+
>>> def join_udf(df0,*df):
... for id,d in enumerate(df):
... if id == 0:
... prevdf = df0.join(d,'col1')
... else:
... prevdf = prevdf.join(d,'col1')
... return prevdf
...
>>> jdf = join_udf(df,df1,df2,df3,df4)
>>> jdf.show()
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col2|col3|col4|col2|col3|col4|col2|col3|col4|col2|col3|col4|
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 7| 8| 9| 89| 45| 67| 87| 64| 35| 99| 101| 345|
| 3| 4| 5| 6| 9| 5| 7| 23| 34| 56| 65| 21| 32| 67| 53| 21|
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+