Apache spark pyspark数据帧过滤器或基于列表的包含_Apache Spark_Filter_Pyspark_Apache Spark Sql

Apache spark pyspark数据帧过滤器或基于列表的包含

apache-spark filter pyspark

Apache spark pyspark数据帧过滤器或基于列表的包含,apache-spark,filter,pyspark,apache-spark-sql,Apache Spark,Filter,Pyspark,Apache Spark Sql,我正在尝试使用列表过滤pyspark中的数据帧。我想根据列表进行筛选，或者只包括列表中有值的记录。我下面的代码不起作用： # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"]) # define a list of sco

我正在尝试使用列表过滤pyspark中的数据帧。我想根据列表进行筛选，或者只包括列表中有值的记录。我下面的代码不起作用：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

给出以下错误： ValueError:无法将列转换为布尔值：在构建数据帧布尔表达式时，请使用“&”表示“and”，使用“|”表示“or”，使用“~”表示“not”。

无法计算它所说的“df.score in l”，因为df.score为您提供了一个列，并且在该列类型上未定义“in”，请使用“isin”

代码应如下所示：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

请注意，这两者都是可互换的。

我发现对于大型数据帧，

join

实现要比

where

快得多：

def filter_spark_dataframe_by_list(df, column_name, filter_list):
    """ Returns subset of df where df[column_name] is in filter_list """
    spark = SparkSession.builder.getOrCreate()
    filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
    return df.join(filter_df, df[column_name] == filter_df["value"])

基于@user3133475 answer，也可以从

F.col（）

调用该方法，如下所示：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

如何将广播变量作为列表而不是常规python列表来实现这一点？当我尝试这样做时，我得到一个“Broadcast”对象没有属性“\u get\u object\u id”错误。@flyingmeatball我想你可以广播变量\u name.value来访问列表如果你想使用广播，那么这就是方法：

l\u bc=sc.Broadcast（l）

后跟

df.where（df.score.isin（l\u bc.value））