Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/315.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 获取出现在所有RDD中的项目-Pyspark_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 获取出现在所有RDD中的项目-Pyspark

Python 获取出现在所有RDD中的项目-Pyspark,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是spark的新手,我正在尝试筛选最终的rdd,其中包含所有其他rdd中出现的项目 我的代码 a = ['rs1','rs2','rs3','rs4','rs5'] b = ['rs3','rs7','rs10','rs4','rs6'] c = ['rs10','rs13','rs20','rs16','rs1'] d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3'] a_rdd = spark.parallelize(a) b_rdd = spark.para

我是spark的新手,我正在尝试筛选最终的rdd,其中包含所有其他rdd中出现的项目

我的代码

a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs10','rs4','rs6']
c = ['rs10','rs13','rs20','rs16','rs1']
d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3']

a_rdd = spark.parallelize(a)
b_rdd = spark.parallelize(b)
c_rdd = spark.parallelize(c)
d_rdd = spark.parallelize(d)

rdd = spark.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
结果:['rs4','rs16','rs5','rs6','rs7','rs20','rs1','rs13','rs10','rs2','rs3']

我的预期结果是['rs3','rs4']


谢谢你

当您说您想要一个包含所有rdd中的项目的rdd时,您是指交叉点?如果是这种情况,则不应使用并集,并且RDD的交集为空(在4个RDD中没有重复的元素)

但如果您需要进行RDD的交叉:

    def intersection(*args):
         return reduce(lambda x,y:x.intersection(y),args)

    a = ['rs1','rs2','rs3','rs4','rs5']
    b = ['rs3','rs7','rs1','rs2','rs6']
    c = ['rs10','rs13','rs2','rs16','rs1']
    d = ['rs2', 'rs4', 'rs1', 'rs13', 'rs3']

    a_rdd = sc.parallelize(a)
    b_rdd = sc.parallelize(b)
    c_rdd = sc.parallelize(c)
    d_rdd = sc.parallelize(d)

    rdd = sc.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
    intersection(a_rdd, b_rdd, c_rdd, d_rdd).collect()

输出为['rs1','rs2']

我建议您阅读更多有关文档的信息。尝试检查内部连接。我的错,它没有找到API文档页面,我会花更多时间在上面谢谢我有一个建议
reduce
您可以这样添加:
reduce(RDD.intersection,args)
啊,是的,这是一个更优雅的方法:)