Python 在Pyspark中的多列上使用多个条件筛选spark数据帧
我想在Pyspark中实现以下SQL条件Python 在Pyspark中的多列上使用多个条件筛选spark数据帧,python,dataframe,filter,pyspark,apache-spark-sql,Python,Dataframe,Filter,Pyspark,Apache Spark Sql,我想在Pyspark中实现以下SQL条件 SELECT * FROM table WHERE NOT ( ID = 1 AND Event = 1 ) AND NOT ( ID = 2 AND Event = 2 )
SELECT *
FROM table
WHERE NOT ( ID = 1
AND Event = 1
)
AND NOT ( ID = 2
AND Event = 2
)
AND NOT ( ID = 1
AND Event = 0
)
AND NOT ( ID = 2
AND Event = 0
)
要做到这一点,最干净的方法是什么?对于DataFrame API版本,您可以使用filter或where函数
等效代码如下所示:
df.filter(~((df.ID == 1) & (df.Event == 1)) &
~((df.ID == 2) & (df.Event == 2)) &
~((df.ID == 1) & (df.Event == 0)) &
~((df.ID == 2) & (df.Event == 0)))
对于DataFrame API版本,可以使用过滤器或where函数
等效代码如下所示:
df.filter(~((df.ID == 1) & (df.Event == 1)) &
~((df.ID == 2) & (df.Event == 2)) &
~((df.ID == 1) & (df.Event == 0)) &
~((df.ID == 2) & (df.Event == 0)))
如果您很懒,可以将SQL筛选器表达式复制并粘贴到pyspark筛选器中:
df.filter("""
NOT ( ID = 1
AND Event = 1
)
AND NOT ( ID = 2
AND Event = 2
)
AND NOT ( ID = 1
AND Event = 0
)
AND NOT ( ID = 2
AND Event = 0
)
""")
如果您很懒,可以将SQL筛选器表达式复制并粘贴到pyspark筛选器中:
df.filter("""
NOT ( ID = 1
AND Event = 1
)
AND NOT ( ID = 2
AND Event = 2
)
AND NOT ( ID = 1
AND Event = 0
)
AND NOT ( ID = 2
AND Event = 0
)
""")
如果你想创建一个简单的,我认为你必须定义一个UDF来在你的queryUDF中运行是没有必要的,它只会使性能下降如果你想创建一个简单的,我认为你必须定义一个UDF来在你的queryUDF中运行是没有必要的,它只会使性能下降