Python 基于来自两个数据帧的日期和ID进行筛选:Pyspark
我有两个数据帧: DF1: DF2: 我想根据DF1的ID过滤数据帧DF2,并在Dx_Min_Date和Dx_Max_Date列给出的最小和最大日期之间。导致:Python 基于来自两个数据帧的日期和ID进行筛选:Pyspark,python,pyspark,Python,Pyspark,我有两个数据帧: DF1: DF2: 我想根据DF1的ID过滤数据帧DF2,并在Dx_Min_Date和Dx_Max_Date列给出的最小和最大日期之间。导致: +----------+-----------+-----------+ | ID| Procedure| Date| | 30794324| 32| 2014-06-21| | 30794324| 14| 2014-04-25| +----------+---------
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
+----------+-----------+-----------+
有没有一种方法可以根据一个数据帧的列过滤另一个数据帧?使用非等联接:
df2.alias('tmp').join(
df1,
(df2.ID == df1.ID) &
(df2.Date >= df1.Dx_Min_Date) &
(df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
| ID|Procedure| Date|
+--------+---------+----------+
|30794324| 32|2014-06-21|
|30794324| 14|2014-04-25|
+--------+---------+----------+
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
+----------+-----------+-----------+
df2.alias('tmp').join(
df1,
(df2.ID == df1.ID) &
(df2.Date >= df1.Dx_Min_Date) &
(df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
| ID|Procedure| Date|
+--------+---------+----------+
|30794324| 32|2014-06-21|
|30794324| 14|2014-04-25|
+--------+---------+----------+