Join Pyspark如何基于条件连接数据帧,并在连接中仅保留直接满足条件的第一行?

Join Pyspark如何基于条件连接数据帧,并在连接中仅保留直接满足条件的第一行?,join,pyspark,conditional-statements,Join,Pyspark,Conditional Statements,我有以下pyspark数据帧: import pyspark.sql.functions as f from pyspark.sql import Window import pandas as pd d={'identifier':['A','B'], 'datetime':['15/01/2021 13:01:06','15/01/2021 09:31:25'], 'Price1': [15.5,14.3]} df1=pd.DataFrame(da

我有以下pyspark数据帧:

 import pyspark.sql.functions as f
 from pyspark.sql import Window
 import pandas as pd

 d={'identifier':['A','B'],
       'datetime':['15/01/2021  13:01:06','15/01/2021  09:31:25'],
       'Price1': [15.5,14.3]}
 df1=pd.DataFrame(data=d)
 df1=spark.createDataFrame(df1)

 d={'identifier':['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','B','B'],
       'datetime':['15/01/2021 09:30:06','15/01/2021 10:15:06','15/01/2021 10:47:55','15/01/2021 11:32:47','15/01/2021 12:22:59','15/01/2021 13:00:54','15/01/2021 13:17:12','15/01/2021 13:55:16','15/01/2021 14:35:14','15/01/2021 15:43:32','15/01/2021 15:52:48','15/01/2021 16:10:10','15/01/2021 16:35:15','15/01/2021 17:33:42','15/01/2021 18:43:25','15/01/2021 19:21:02','15/01/2021 09:30:06','15/01/2021 11:10:06','15/01/2021 11:28:55','15/01/2021 11:32:26','15/01/2021 12:22:59','15/01/2021 13:00:54','15/01/2021 13:17:12','15/01/2021 13:55:16','15/01/2021 14:35:14','15/01/2021 15:43:32','15/01/2021 15:52:48','15/01/2021 16:10:10'],
       'Price2': [15.7,15.2,15.6,15.6,15.9,16,15.8,15.8,15.1,15.25,14.8,15.3,15.7,15.6,15.2,15.1,13.9,14.3,14.5,14.2,14.1,14.5,14.6,14.7,14.3,14.2,14.1,14]}
 df2=pd.DataFrame(data=d)
 df2=spark.createDataFrame(df2)
我的最终目标是为df1中的每一行找到标识符相同的df2中的第一行,df2中的datetime列的时间高于或等于df1中的时间,并且price 1低于price 2。总之,我在寻找在df1 datetime列中指定的时间之后,同一标识符的df2中的第一个价格低于df1中的价格。

我做了下面的工作,这给了我想要的,但是我遇到了数据倾斜,因为我使用的原始df1数据帧有300000行,而df2有5亿行,所以我需要一种更有效的方法来进行连接。我想知道是否有一种方法可以实现我在连接中直接描述的功能,或者至少有一种更有效的方法,因为在连接之后,组合的数据帧实际上变得不可用

   w = Window.partitionBy("identifier").orderBy("datetime")
   df2 = df2.withColumn("rownumber", row_number().over(w))

  joineddf=df2.join(F.broadcast(df1),(df1.identifier=df2.identifier) &\
                                      (df1.datetime<=df2.datetime)&\
                                      (df1.Price1>=df2.Price2))
 window = Window.partitionBy("identifier").orderBy("rownumber")

 joineddf=F.min('rownumber').over(window).alias('minrownumber')
 joineddf=joineddf.filter(F.col('rownumber)==F.col(minrownumber))
w=Window.partitionBy(“标识符”).orderBy(“日期时间”)
df2=df2.withColumn(“rownumber”,row_number()。在(w)上方)
joineddf=df2.join(F.broadcast(df1),(df1.identifier=df2.identifier)&\
(df1.datetime=df2.Price2))
window=window.partitionBy(“标识符”).orderBy(“行号”)
joineddf=F.min('rownumber')。在(窗口)上方。别名('minrownumber'))
joineddf=joineddf.filter(F.col('rownumber)==F.col(minrownumber))
您可能希望尝试“加入时起”,例如和