Apache spark Pypark排序值_Apache Spark_Sorting_Pyspark_Mapreduce_Rdd

Apache spark Pypark排序值

apache-spark sorting pyspark mapreduce

Apache spark Pypark排序值,apache-spark,sorting,pyspark,mapreduce,rdd,Apache Spark,Sorting,Pyspark,Mapreduce,Rdd,我有一个数据： [(u'ab', u'cd'), (u'ef', u'gh'), (u'cd', u'ab'), (u'ab', u'gh'), (u'ab', u'cd')] 我想对这些数据做一个mapreduce，并找出同一对出现的频率因此，我得到： [((u'ab', u'cd'), 2), ((u'cd', u'ab'), 1), ((u'ab', u'gh'), 1), ((u'ef', u'gh'), 1)] 正如你所看到的，它不是必需的，因为（u'ab'，u'c

我有一个数据：

[(u'ab', u'cd'),
 (u'ef', u'gh'),
 (u'cd', u'ab'),
 (u'ab', u'gh'),
 (u'ab', u'cd')]

我想对这些数据做一个mapreduce，并找出同一对出现的频率

因此，我得到：

[((u'ab', u'cd'), 2),
 ((u'cd', u'ab'), 1),
 ((u'ab', u'gh'), 1),
 ((u'ef', u'gh'), 1)]

正如你所看到的，它不是必需的，因为（u'ab'，u'cd'）必须是3而不是2，因为（u'cd'，u'ab'）是同一对

我的问题是如何使程序将（u'cd'，u'ab'）和（u'ab'，u'cd'）计算为同一对？我正在考虑对每行的值进行排序，但找不到任何解决方案。

您可以按排序的元素设置关键帧，并按关键帧进行计数：

result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()

print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})

您可以对这些值进行排序，然后使用

reduceByKey

对这些值对进行计数：

rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
    .reduceByKey(lambda a, b: a + b)

rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]

rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
    .reduceByKey(lambda a, b: a + b)

rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]