Python 过滤具有多个ID的数据帧<；PKs>；通过在列表中传递每个ID的值来创建列_Python_Sql_Dataframe_Pyspark_Apache Spark Sql

Python 过滤具有多个ID的数据帧<；PKs>；通过在列表中传递每个ID的值来创建列

python sql dataframe pyspark

Python 过滤具有多个ID的数据帧<；PKs>；通过在列表中传递每个ID的值来创建列,python,sql,dataframe,pyspark,apache-spark-sql,Python,Sql,Dataframe,Pyspark,Apache Spark Sql,尝试通过在列表中传递每个ID的值来过滤具有多个ID列的数据帧。例如：Df： location_user transactiontime (string) user_id (bigint) location_id (bigint) Address1 (string) Address2 (string) user_name (string) loc_name (string) 在上面的数据框中：user\u id和location\u id都是id列目标：根据数据帧筛选用户_id=[42939

尝试通过在列表中传递每个ID的值来过滤具有多个ID列的数据帧。

例如：Df：

location_user
transactiontime (string)
user_id (bigint)
location_id (bigint)
Address1 (string)
Address2 (string)
user_name (string)
loc_name (string)

在上面的数据框中：user\u id和location\u id都是id列

目标：根据数据帧筛选用户_id=[4293942940]和位置_id=[14681469]。

创建如下单独列表，并将其应用于df.filter

partition_key =['user_id', 'location_id']
filter_cond = ['[42939,42940]', '[1468,1469]']

--->为单分区密钥工作

filter_df=actual_df.filter(~col(partition_key).isin(filter_cond))

尝试在下面输入分区_键的组合，但它不工作，并且出现下面的错误

filter_df=actual_df.filter(~col(partition_key).isInCollection(filter_cond))

错误：覆盖目录时出错。请检查一下传递了正确的参数。异常：运行时发生错误调用z:org.apache.spark.sql.functions.col。跟踪： py4j.Py4JException:方法col（[class java.util.ArrayList]）不存在存在

感谢您的建议。

您可以通过压缩以下条件来实现这一点

partition_key =['id', 'id2']
filter_cond = [[1,2], [100,200]]
cond = ' AND '.join([f'{colname} in {tuple(cond)}' for colname, cond in zip(partition_key,filter_cond)])
print(cond)

df.filter(expr(cond)).show()

#id in (1, 2) AND id2 in (100, 200)
#+---+---+
#| id|id2|
#+---+---+
#|  1|100|
#|  1|200|
#|  2|100|
#|  2|200|
#+---+---+

单个元素的更新

cond = ' AND '.join([f'{colname} in ({",".join(map(str,a))})' for colname, cond in zip(partition_key,filter_cond)])

嗨，Shubham，谢谢你的回复。我已经尝试过了，这次我使用了下面的过滤器分区键：['user\u id'，'location\u id']和[[17954]，[3350]]它在列表中只有一个值。但由于条件“（17954，）和位置（3350，）中的多余逗号，它失败了。”。但是，对于包含多个列的列表，它工作得很好。非常感谢。