Apache spark 如何在PySpark中对列表中的键值对求和

Apache spark 如何在PySpark中对列表中的键值对求和,apache-spark,pyspark,Apache Spark,Pyspark,我有一个rdd,在列表中包含键值 rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]), ('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]), ('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4

我有一个rdd,在列表中包含键值

rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]), 
       ('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]), 
       ('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]
我必须将每个键的值添加到每个记录的列表中。我尝试了下面的方法,但没有成功,因为mapValues不允许在列表中添加

newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum)))) 
我的预期结果如下

[('12583', ('536370', 11.25)), 
('17850', ('536365', 8.39)), 
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]

您可以使用集合定义列表聚合函数。defaultdict:

def agg_list(lst):
    from collections import defaultdict
    agg = defaultdict(lambda : 0)
    for k, v in lst:
        agg[k] += v
    return list(agg.items())
然后将其映射到
rdd

rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)], 
#  ['17850', ('536365', 8.69)], 
#  ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]

您可以使用集合定义列表聚合函数。defaultdict:

def agg_list(lst):
    from collections import defaultdict
    agg = defaultdict(lambda : 0)
    for k, v in lst:
        agg[k] += v
    return list(agg.items())
然后将其映射到
rdd

rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)], 
#  ['17850', ('536365', 8.69)], 
#  ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]