Python 在Pyspark中的多列上使用多个条件筛选spark数据帧

Python 在Pyspark中的多列上使用多个条件筛选spark数据帧,python,dataframe,filter,pyspark,apache-spark-sql,Python,Dataframe,Filter,Pyspark,Apache Spark Sql,我想在Pyspark中实现以下SQL条件 SELECT * FROM table WHERE NOT ( ID = 1 AND Event = 1 ) AND NOT ( ID = 2 AND Event = 2 )

我想在Pyspark中实现以下SQL条件

SELECT *
            FROM   table
            WHERE  NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
要做到这一点,最干净的方法是什么?

对于DataFrame API版本,您可以使用filter或where函数

等效代码如下所示:

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))
对于DataFrame API版本,可以使用过滤器或where函数

等效代码如下所示:

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))

如果您很懒,可以将SQL筛选器表达式复制并粘贴到pyspark筛选器中:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")

如果您很懒,可以将SQL筛选器表达式复制并粘贴到pyspark筛选器中:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")

如果你想创建一个简单的,我认为你必须定义一个UDF来在你的queryUDF中运行是没有必要的,它只会使性能下降如果你想创建一个简单的,我认为你必须定义一个UDF来在你的queryUDF中运行是没有必要的,它只会使性能下降