Python Pyspark:字符串列上的多个筛选器
假设下表是pyspark dataframe,我想对多个值的列ind应用过滤器。如何在pyspark中执行此操作Python Pyspark:字符串列上的多个筛选器,python,pandas,pyspark,apache-spark-sql,pyspark-sql,Python,Pandas,Pyspark,Apache Spark Sql,Pyspark Sql,假设下表是pyspark dataframe,我想对多个值的列ind应用过滤器。如何在pyspark中执行此操作 ind group people value John 1 5 100 Ram 1 2 2 John 1 10 80 Tom 2 20 40 Tom 1 7 10 Anil 2 23 30 我试着跟随,但没有成功 filter = ['John'
ind group people value
John 1 5 100
Ram 1 2 2
John 1 10 80
Tom 2 20 40
Tom 1 7 10
Anil 2 23 30
我试着跟随,但没有成功
filter = ['John', 'Ram']
filtered_df = df.filter("ind == filter ")
filtered_df.show()
如何在spark中实现这一点?您可以使用:
filter = ['John', 'Ram']
filtered_df = df.filter("ind in ('John', 'Ram') ")
filtered_df.show()
或
如果您想在列表中包含过滤器。还请注意,我们使用单等号
=
而不是双等号=
来测试pyspark中的相等性(如SQL中)这与您想要的正好相反:-因此您知道需要在函数/运算符中使用。此处:可能重复
filter = ['John', 'Ram']
processed_for_pyspark = ', '.join(['\'' + s + '\'' for s in filter])
filtered_df = df.filter("ind in ({}) ".format(processed_for_puspark))
filtered_df.show()