Python 获取出现在所有RDD中的项目-Pyspark
我是spark的新手,我正在尝试筛选最终的rdd,其中包含所有其他rdd中出现的项目 我的代码Python 获取出现在所有RDD中的项目-Pyspark,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是spark的新手,我正在尝试筛选最终的rdd,其中包含所有其他rdd中出现的项目 我的代码 a = ['rs1','rs2','rs3','rs4','rs5'] b = ['rs3','rs7','rs10','rs4','rs6'] c = ['rs10','rs13','rs20','rs16','rs1'] d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3'] a_rdd = spark.parallelize(a) b_rdd = spark.para
a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs10','rs4','rs6']
c = ['rs10','rs13','rs20','rs16','rs1']
d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3']
a_rdd = spark.parallelize(a)
b_rdd = spark.parallelize(b)
c_rdd = spark.parallelize(c)
d_rdd = spark.parallelize(d)
rdd = spark.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
结果:['rs4','rs16','rs5','rs6','rs7','rs20','rs1','rs13','rs10','rs2','rs3']
我的预期结果是['rs3','rs4']
谢谢你 当您说您想要一个包含所有rdd中的项目的rdd时,您是指交叉点?如果是这种情况,则不应使用并集,并且RDD的交集为空(在4个RDD中没有重复的元素) 但如果您需要进行RDD的交叉:
def intersection(*args):
return reduce(lambda x,y:x.intersection(y),args)
a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs1','rs2','rs6']
c = ['rs10','rs13','rs2','rs16','rs1']
d = ['rs2', 'rs4', 'rs1', 'rs13', 'rs3']
a_rdd = sc.parallelize(a)
b_rdd = sc.parallelize(b)
c_rdd = sc.parallelize(c)
d_rdd = sc.parallelize(d)
rdd = sc.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
intersection(a_rdd, b_rdd, c_rdd, d_rdd).collect()
输出为['rs1','rs2']我建议您阅读更多有关文档的信息。尝试检查内部连接。我的错,它没有找到API文档页面,我会花更多时间在上面谢谢我有一个建议
reduce
您可以这样添加:reduce(RDD.intersection,args)
啊,是的,这是一个更优雅的方法:)