Apache spark 建立一个方法调用来动态过滤pyspark数据帧前提_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 建立一个方法调用来动态过滤pyspark数据帧前提

apache-spark pyspark

Apache spark 建立一个方法调用来动态过滤pyspark数据帧前提,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有一个ETL管道，在那里我在大型事实表（3-4亿行）上升级某些分区，为了简单起见，这是我的数据帧 display(delta_df) id name age salary last_modified_date Year Month 1 John 30 2000 2019-06-01 2019 6 #this should stay. 2 Peter 35 1500 2018-08-02 2018 9

我有一个ETL管道，在那里我在大型事实表（3-4亿行）上升级某些分区，为了简单起见，这是我的数据帧

display(delta_df)
id   name  age  salary last_modified_date  Year  Month
1   John   30    2000         2019-06-01  2019      6  #this should stay.
2   Peter   35    1500        2018-08-02  2018      9  #duplicate record will be removed after union.

Year

和

Month

是我的配置单元分区列

这是我完整的事实表

display(fact_df)
   id   name  age  salary last_modified_date  Year  Month
   1   John   30    1000         2019-05-01  2019      6 # this should stay.
   2   Peter  35    1500         2018-08-02  2018      9 # duplicate record.
   3   Gabe   21     800         2015-02-03  2015      2 # this row should be filtered out. 
   4   Oscar  29    2000         2020-05-04  2020      6 # this row should be filtered out. 
   5   Anna   20    1200         2010-11-05  2018      9 # this should stay.

问题现在，在执行

union

和

row\u number

以消除重复数据并应用任何业务逻辑之前，我只想读取第一个数据帧中存在的分区

我知道我可以通过使用

isin

方法调用手动完成这项工作

然而，由于这是ETL管道的一部分，我需要使其动态化

fact_df.filter(col('Year').isin(delta_df.select('Year').distinct().collect() & 
                                delta_df.select('Month').distinct().collect() )

我试图创建一个解包词典，然后将其传递进来，但我不知道如何链接

from pyspark.sql.functions import col

[col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()]
out:
[Column<b'(Year IN (2019, 2020))'>, Column<b'(Month IN (4, 5))'>]

问题: 如何根据执行时可用的动态变量安全地筛选数据帧

此数据集有“年”、“月”，但另一个数据集可能有“年”、“月”、“DayofYear”和“PostalDistrict”

您可以使用reduce：

from functools import reduce

reduce(lambda a, b: a & b, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

# or if you want to do it with style...
from operator import and_
reduce(and_, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

PS从另一个角度来看待这一点-半连接可以工作吗

fact_df.join(delta_df, ['Year', 'Month'], 'semi')

您可以使用reduce：

from functools import reduce

reduce(lambda a, b: a & b, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

# or if you want to do it with style...
from operator import and_
reduce(and_, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

PS从另一个角度来看待这一点-半连接可以工作吗

fact_df.join(delta_df, ['Year', 'Month'], 'semi')

谢谢，reduce看起来很有前途，

semi

连接看起来非常干净。您能解释一下semi是如何工作的吗？semi join保留左表中满足连接条件的行。还有一个反连接，它保留不满足连接条件的行。假设

on

子句缺少此表的

主键

，这不会导致生成产品吗？不，它与内部连接稍有不同，它不是真正有价值的连接。有关详细信息，请参阅此链接底部：谢谢，reduce看起来很有前途，

semi

连接看起来非常干净。您能解释一下semi是如何工作的吗？semi join保留左表中满足连接条件的行。还有一个反联接，它保留不满足联接条件的行。假设

on

子句缺少此表的

主键，这不会导致产品吗？不，它与内部联接略有不同，它不是真正值得的联接。有关详细信息，请参阅此链接底部：