Python 如何使用pyspark和scala查找时间范围内每分钟发生的事件_Python_Scala_Pyspark

Python 如何使用pyspark和scala查找时间范围内每分钟发生的事件

python scala pyspark

Python 如何使用pyspark和scala查找时间范围内每分钟发生的事件,python,scala,pyspark,Python,Scala,Pyspark,在这场比赛中有130分钟，我如何使用tweet id来计算每分钟的tweet数量预期结果： tweet id tweet created minute Game start minute Game end minute 1001 145678 145600 145730 1002 145678 145600 145730 1005

在这场比赛中有130分钟，我如何使用tweet id来计算每分钟的tweet数量

预期结果：

tweet id  tweet created minute  Game start minute  Game end minute         
1001      145678                145600             145730   
1002      145678                145600             145730   
1005      145680                145600             145730   
12278     145687                145600             145730     
765558    145688                145600             145730     
724323    145689                145600             145730     
875857    145688                145600             145730     
79375     145685                145600             145730     
84666     145686                145600             145730     
335556    145687                145600             145730     
29990     145688                145600             145730     
56        145689                145600             145730 
968867    145690                145600             145730     
8452      145691                145600             145730   
1334      145679                145600             145730

假设tweet id是唯一的，并使用Pyspark和原始rdd：

minutes  count of tweets       
1        100 
2        34 
3        56 
4        234 
5        2310 
6        345 
7        56 
8        55 
9        12 
10       245

在scala中重新编写的任何帮助都将非常有用=>

rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730), 
(12278 ,145687, 145600, 145730), 
(765558 ,145688, 145600, 145730), 
(724323 ,145689, 145600, 145730), 
(875857 ,145688, 145600, 145730), 
(79375 ,145685, 145600, 145730), 
(84666 ,145686, 145600, 145730), 
(335556 ,145687, 145600, 145730), 
(29990 ,145688, 145600, 145730), 
(56 ,145689, 145600, 145730), 
(968867 ,145690, 145600, 145730), 
(8452 ,145691, 145600, 145730), 
(1334 ,145679, 145600, 145730) ])

result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()

print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
    print "{0}\t{1}".format(i[0], i[1])

minutes count of tweets
78  2
79  1
80  1
85  1
86  1
87  2
88  3
89  2
90  1
91  1