Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建与源RDD中共享密钥的元素的成对RDD?_Python_Apache Spark_Pyspark_Spark Graphx - Fatal编程技术网

Python 如何创建与源RDD中共享密钥的元素的成对RDD?

Python 如何创建与源RDD中共享密钥的元素的成对RDD?,python,apache-spark,pyspark,spark-graphx,Python,Apache Spark,Pyspark,Spark Graphx,我在pyspark中有一个键值RDD,并希望返回源RDD中具有相同键的成对RDD #input rdd of id and user rdd1 = sc.parallelize([(1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3,"user4"), (3,"user1")]) #desired output [("user1","user2"),("user1","user3"),("user1","

我在pyspark中有一个键值RDD,并希望返回源RDD中具有相同键的成对RDD

#input rdd of id and user
rdd1 = sc.parallelize([(1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3,"user4"), (3,"user1")])

#desired output
[("user1","user2"),("user1","user3"),("user1","user4"),("user2","user4")]

到目前为止,我还无法找到正确的函数组合来实现这一点。其目的是根据共享的公用密钥创建用户的边缘列表。

据我所知,您的描述应该是这样的:

output = (rdd1
   .groupByKey()
   .mapValues(set)
   .flatMap(lambda kvs: [(x, y) for x in kvs[1] for y in kvs[1] if x < y])
   .distinct())
output=(rdd1
.groupByKey()
.mapValues(设置)
.flatMap(λkvs:[(x,y)表示x英寸kvs[1]表示y英寸kvs[1]如果x

不幸的是,这是一个相当昂贵的操作。

我认为您可以尝试aggregateByKey()并实现一些您自己的逻辑,以获得比groupByKey()稍好的性能。在最终合并之前,将首先在分区侧合并。@ChrisChambers示例输入中没有任何内容表明每个键都有重复的值,因此
mapValues(set)
只是一种预防措施。否则当然值得一试。