Apache spark 如何在PySpark中对列表中的键值对求和
我有一个rdd,在列表中包含键值Apache spark 如何在PySpark中对列表中的键值对求和,apache-spark,pyspark,Apache Spark,Pyspark,我有一个rdd,在列表中包含键值 rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]), ('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]), ('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4
rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]),
('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]),
('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]
我必须将每个键的值添加到每个记录的列表中。我尝试了下面的方法,但没有成功,因为mapValues不允许在列表中添加
newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum))))
我的预期结果如下
[('12583', ('536370', 11.25)),
('17850', ('536365', 8.39)),
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]
您可以使用集合定义列表聚合函数。defaultdict:
def agg_list(lst):
from collections import defaultdict
agg = defaultdict(lambda : 0)
for k, v in lst:
agg[k] += v
return list(agg.items())
然后将其映射到rdd
:
rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)],
# ['17850', ('536365', 8.69)],
# ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]
您可以使用集合定义列表聚合函数。defaultdict:
def agg_list(lst):
from collections import defaultdict
agg = defaultdict(lambda : 0)
for k, v in lst:
agg[k] += v
return list(agg.items())
然后将其映射到rdd
:
rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)],
# ['17850', ('536365', 8.69)],
# ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]