Python 在pyspark rdd中将多个dict组合到另一个dict
我有一个数据框,如下所示:Python 在pyspark rdd中将多个dict组合到另一个dict,python,pyspark,rdd,Python,Pyspark,Rdd,我有一个数据框,如下所示: from pyspark.sql import SparkSession sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate() data = [(1,2,0.1,0.3),(1,2,0.1,0.3),(1,3,0.1,0.3),(1,3,0.1,0.3), (11, 12, 0.1, 0.3),(11,12,0.1,0.3),(11,13
from pyspark.sql import SparkSession
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [(1,2,0.1,0.3),(1,2,0.1,0.3),(1,3,0.1,0.3),(1,3,0.1,0.3),
(11, 12, 0.1, 0.3),(11,12,0.1,0.3),(11,13,0.1,0.3),(11,13,0.1,0.3)]
trajectory_df = sqlContext.createDataFrame(data, schema=['grid_id','rider_id','lng','lat'])
trajectory_df.show()
+-------+--------+---+---+
|grid_id|rider_id|lng|lat|
+-------+--------+---+---+
| 1| 2|0.1|0.3|
| 1| 2|0.1|0.3|
| 1| 3|0.1|0.3|
| 1| 3|0.1|0.3|
| 11| 12|0.1|0.3|
| 11| 12|0.1|0.3|
| 11| 13|0.1|0.3|
| 11| 13|0.1|0.3|
+-------+--------+---+---+
我想将来自同一网格的数据合并到dict中。其中,rider\u id
是dict的键,纬度和经度是dict的值
我预期的结果如下:
[(1, {3:[[0.1, 0.3], [0.1, 0.3]],2:[[0.1, 0.3], [0.1, 0.3]]}),
(11,{13:[[0.1, 0.3], [0.1, 0.3]],12:[[0.1, 0.3], [0.1, 0.3]]})]
我可以使用groupByKey()
对grid\u id
进行分组
def trans_point(row):
return ((row.grid_id, row.rider_id), [row.lng, row.lat])
trajectory_df = trajectory_df.rdd.map(trans_point).groupByKey().mapValues(list)
print(trajectory_df.take(10))
[((1, 3), [[0.1, 0.3], [0.1, 0.3]]), ((11, 13), [[0.1, 0.3], [0.1, 0.3]]), ((1, 2), [[0.1, 0.3], [0.1, 0.3]]), ((11, 12), [[0.1, 0.3], [0.1, 0.3]])]
但当我合并多个dict时,我无法得到结果:
trajectory_df = trajectory_df.map(lambda x:(x[0][0],{x[0][1]:x[1]})).reduceByKey(lambda x,y:x.update(y))
print(trajectory_df.take(10))
[(1, None), (11, None)]
出于某些原因,我希望它是在RDD类型下完成的。我怎样才能做到呢?提前感谢。正常工作并返回None
。从文档中:
使用其他键/值对更新字典,覆盖现有键。返回None
您需要编写自己的reduce函数来组合字典。我们可以从上一页的
dict.update
就地工作并返回None
@保罗:非常感谢。你能把答案贴出来吗?我会接受的?
def merge_two_dicts(x, y):
"""From https://stackoverflow.com/a/26853961/5858851"""
z = x.copy() # start with x's keys and values
z.update(y) # modifies z with y's keys and values & returns None
return z
trajectory_df = trajectory_df.map(lambda x:(x[0][0],{x[0][1]:x[1]}))\
.reduceByKey(merge_two_dicts)
print(trajectory_df.collect())
#[(1, {2: [[0.1, 0.3], [0.1, 0.3]], 3: [[0.1, 0.3], [0.1, 0.3]]}),
# (11, {12: [[0.1, 0.3], [0.1, 0.3]], 13: [[0.1, 0.3], [0.1, 0.3]]})]