Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 根据条件删除RDD值_Python_Apache Spark_Pyspark_Rdd - Fatal编程技术网

Python 根据条件删除RDD值

Python 根据条件删除RDD值,python,apache-spark,pyspark,rdd,Python,Apache Spark,Pyspark,Rdd,我有这样一个RDD: [ (Person 1, [Cat, Dog, Cow]), (Person 2, [Cat]), (Person 3,[Cow, Chicken])] 我有一份常见动物的名单: freq_animals=[Cat, Dog] 我想在我的RDD中删除不在常见动物列表中的每个人的值,即输出为: [ (Person 1, [Cat, Dog]), (Person 2, [Cat]), (Person 3,[])] 你知道我怎样才能改变我的RDD吗? 谢谢大家! 您可以使用

我有这样一个RDD:

[ (Person 1, [Cat, Dog, Cow]), (Person 2, [Cat]), (Person 3,[Cow, Chicken])]
我有一份常见动物的名单:

freq_animals=[Cat, Dog]
我想在我的RDD中删除不在常见动物列表中的每个人的值,即输出为:

[ (Person 1, [Cat, Dog]), (Person 2, [Cat]), (Person 3,[])]
你知道我怎样才能改变我的RDD吗?
谢谢大家!

您可以使用列表执行
mapValues

rdd = sc.parallelize([("Person 1", ["Cat", "Dog", "Cow"]), ("Person 2", ["Cat"]), ("Person 3", ["Cow", "Chicken"])])

freq_animals = ["Cat", "Dog"]

rdd2 = rdd.mapValues(lambda v: [i for i in v if i in freq_animals])

print(rdd2.collect())
# [('Person 1', ['Cat', 'Dog']), ('Person 2', ['Cat']), ('Person 3', [])]