Python 基于来自两个数据帧的日期和ID进行筛选:Pyspark

Python 基于来自两个数据帧的日期和ID进行筛选:Pyspark,python,pyspark,Python,Pyspark,我有两个数据帧: DF1: DF2: 我想根据DF1的ID过滤数据帧DF2,并在Dx_Min_Date和Dx_Max_Date列给出的最小和最大日期之间。导致: +----------+-----------+-----------+ | ID| Procedure| Date| | 30794324| 32| 2014-06-21| | 30794324| 14| 2014-04-25| +----------+---------

我有两个数据帧: DF1:

DF2:

我想根据DF1的ID过滤数据帧DF2,并在Dx_Min_Date和Dx_Max_Date列给出的最小和最大日期之间。导致:

+----------+-----------+-----------+
|        ID|  Procedure|       Date|
|  30794324|         32| 2014-06-21|
|  30794324|         14| 2014-04-25|
+----------+-----------+-----------+
有没有一种方法可以根据一个数据帧的列过滤另一个数据帧?

使用非等联接:

df2.alias('tmp').join(
    df1, 
    (df2.ID == df1.ID) & 
    (df2.Date >= df1.Dx_Min_Date) & 
    (df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
|      ID|Procedure|      Date|
+--------+---------+----------+
|30794324|       32|2014-06-21|
|30794324|       14|2014-04-25|
+--------+---------+----------+
+----------+-----------+-----------+
|        ID|  Procedure|       Date|
|  30794324|         32| 2014-06-21|
|  30794324|         14| 2014-04-25|
+----------+-----------+-----------+
df2.alias('tmp').join(
    df1, 
    (df2.ID == df1.ID) & 
    (df2.Date >= df1.Dx_Min_Date) & 
    (df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
|      ID|Procedure|      Date|
+--------+---------+----------+
|30794324|       32|2014-06-21|
|30794324|       14|2014-04-25|
+--------+---------+----------+