Python 在将RDD与pyspark一起使用时,如何对一个字段进行平均,同时对另一个字段进行分组?

Python 在将RDD与pyspark一起使用时,如何对一个字段进行平均,同时对另一个字段进行分组?,python,apache-spark,pyspark,rdd,Python,Apache Spark,Pyspark,Rdd,我在groupBy、aggregate、reduceByKey、map等之间缠绕轴。我的目标是为字段2的每个唯一值求出字段16(最后一个字段)的平均值 因此,输出可能类似于: NW -8 DL -6 OO -1 给定具有以下元素的RDD: [u'2002-04-04-04-04-04-04-04-04-04-04-04-05-04-05-05-04-72-72-72.68,MSP,美国邮政,2012-05-05-04-04,中国邮政2012-05-04-04,中国邮政2012-05-04

我在
groupBy
aggregate
reduceByKey
map
等之间缠绕轴。我的目标是为字段2的每个唯一值求出字段16(最后一个字段)的平均值

因此,输出可能类似于:

NW  -8
DL  -6
OO  -1
给定具有以下元素的RDD:

[u'2002-04-04-04-04-04-04-04-04-04-04-04-05-04-05-05-04-72-72-72.68,MSP,美国邮政,2012-05-05-04-04,中国邮政2012-05-04-04,中国邮政2012-05-04-05-04,中国邮政,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,澳大利亚,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国,中国51933-2008-2007-2007-11-114311140,-312281222,-12-12-42-71,MA,MA,42.36-71,马萨诸塞州,42.36-71,马萨诸塞州,马萨诸塞州,马萨诸塞州,42.36-71,马萨诸塞州,马萨诸塞州,42.36,-71-71,男,男,男,女,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,42.36-42.36-71,男,男,男,男,男,男,女,女,男,女,男,男,女,男,男,男,男,男,男,男,男,男,男,男,男,男,女,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男,男-31028,1008,-20',u'2010-06-01,DL,19790,DTW,MI,42.21,-83.35,明尼苏达州密西西比州,44.88,93.22835846,11930946,16',u'2003-09-04,西北,19386,纽约州布夫,42.94,-78.73,明尼苏达州密西西比州,44.88,-93.22900852,-81017955,-22']

以下是一个解决方案:

data = [u'2002-04-28,NW,19386,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,1220,1252,32,1316,1350,34', u'2012-05-04,OO,20304,LSE,WI,43.87,-91.25,MSP,MN,44.88,-93.22,1130,1126,-4,1220,1219,-1', u'2002-08-18,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,805,804,-1,959,952,-7', u'2004-07-29,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,800,757,-3,951,933,-18', u'2008-07-21,NW,19386,IND,IN,39.71,-86.29,MSP,MN,44.88,-93.22,1143,1140,-3,1228,1222,-6', u'2007-10-29,NW,19386,RST,MN,43.9,-92.5,MSP,MN,44.88,-93.22,1546,1533,-13,1639,1609,-30', u'2012-12-24,DL,19790,BOS,MA,42.36,-71,MSP,MN,44.88,-93.22,1427,1431,4,1648,1635,-13', u'2010-04-22,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,930,927,-3,1028,1008,-20', u'2010-06-01,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,93.22,835,846,11,930,946,16', u'2003-09-04,NW,19386,BUF,NY,42.94,-78.73,MSP,MN,44.88,-93.22,900,852,-8,1017,955,-22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : (x.split(","))).map(lambda x : (x[1],x[-1])) \
                 .groupByKey() \ # group by key
                 .map(lambda x : (x[0], map(int, list(x[1])))) \ # convert resultiterable to list
                 .map(lambda x : (x[0], float(sum(x[1]))/len(x[1]))) # compute average on list for each key
# output
rdd.take(10)
# [(u'DL', -5.666666666666667), (u'NW', -8.166666666666666), (u'OO', -1.0)]
以下是一个解决方案:

data = [u'2002-04-28,NW,19386,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,1220,1252,32,1316,1350,34', u'2012-05-04,OO,20304,LSE,WI,43.87,-91.25,MSP,MN,44.88,-93.22,1130,1126,-4,1220,1219,-1', u'2002-08-18,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,805,804,-1,959,952,-7', u'2004-07-29,NW,19386,BDL,CT,41.93,-72.68,MSP,MN,44.88,-93.22,800,757,-3,951,933,-18', u'2008-07-21,NW,19386,IND,IN,39.71,-86.29,MSP,MN,44.88,-93.22,1143,1140,-3,1228,1222,-6', u'2007-10-29,NW,19386,RST,MN,43.9,-92.5,MSP,MN,44.88,-93.22,1546,1533,-13,1639,1609,-30', u'2012-12-24,DL,19790,BOS,MA,42.36,-71,MSP,MN,44.88,-93.22,1427,1431,4,1648,1635,-13', u'2010-04-22,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,-93.22,930,927,-3,1028,1008,-20', u'2010-06-01,DL,19790,DTW,MI,42.21,-83.35,MSP,MN,44.88,93.22,835,846,11,930,946,16', u'2003-09-04,NW,19386,BUF,NY,42.94,-78.73,MSP,MN,44.88,-93.22,900,852,-8,1017,955,-22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : (x.split(","))).map(lambda x : (x[1],x[-1])) \
                 .groupByKey() \ # group by key
                 .map(lambda x : (x[0], map(int, list(x[1])))) \ # convert resultiterable to list
                 .map(lambda x : (x[0], float(sum(x[1]))/len(x[1]))) # compute average on list for each key
# output
rdd.take(10)
# [(u'DL', -5.666666666666667), (u'NW', -8.166666666666666), (u'OO', -1.0)]

好的,这是在黑暗中拍摄的,因为我没有任何环境可以尝试(而且很糟糕)

我假设您的数据中有一个RDD已被拆分

mappedData  = data.map(lambda d : (d[1], d[-1])).cache      // (NW,34), (OO,-1), (NW,-7)
groupedData = mappedData.groupByKey().mapValues(len)        //  (NW, (34, -7)) ->  (NW, 2)
sumData = mappedData.groupByKey().mapValues(sum)            //  (NW, (34, -7)) ->  (NW, 27)
sumData.join(groupedData).map(lambda (x,y) => (x, y[0] / y[1]  )) (NW, (27,2)) -> (NW, 27/2)

好的,这是在黑暗中拍摄的,因为我没有任何环境可以尝试(而且很糟糕)

我假设您的数据中有一个RDD已被拆分

mappedData  = data.map(lambda d : (d[1], d[-1])).cache      // (NW,34), (OO,-1), (NW,-7)
groupedData = mappedData.groupByKey().mapValues(len)        //  (NW, (34, -7)) ->  (NW, 2)
sumData = mappedData.groupByKey().mapValues(sum)            //  (NW, (34, -7)) ->  (NW, 27)
sumData.join(groupedData).map(lambda (x,y) => (x, y[0] / y[1]  )) (NW, (27,2)) -> (NW, 27/2)

这是航空公司数据,所以每行的第二个字段是航空公司代码。每行的最后一个字段是以分钟为单位的到达延迟。我要做的是获取每个航空公司代码的平均延迟时间。这是航空公司数据,所以每行的第二个字段是航空公司代码。每行的最后一个字段是以分钟为单位的到达延迟。我是什么我想做的是得到每个航空公司代码的平均延迟时间。再次感谢。很棒的东西。我欠你!!解决方案到目前为止还没有优化。但是我不喜欢用python编码,这很容易理解!准确地说-优化之前先简单。我又偏爱python了。很棒的东西。我欠你!!解决方案到目前为止,on还没有优化。但我不喜欢用python编写代码,这很容易理解!正是——优化之前先简单。我偏爱python