如何使用Python中的MapReduce计算单词序列在文件中出现的次数?
考虑一个包含由空格分隔的单词的文件;用Python编写MapReduce程序, 它统计每个3字序列在文件中出现的次数如何使用Python中的MapReduce计算单词序列在文件中出现的次数?,python,oop,hadoop,mapreduce,mrjob,Python,Oop,Hadoop,Mapreduce,Mrjob,考虑一个包含由空格分隔的单词的文件;用Python编写MapReduce程序, 它统计每个3字序列在文件中出现的次数 例如,考虑以下文件: one two three seven one two three three seven one seven one two 此文件中每个3字序列出现的次数为: "three seven one" 2 "four seven one two" 1 "one two three" 2 "
例如,考虑以下文件:
one two three seven one two three
three seven one
seven one two
此文件中每个3字序列出现的次数为:
"three seven one" 2
"four seven one two" 1
"one two three" 2
"seven one two" 2
"two three seven" 1
代码格式:
from mrjob.job import MRJob
class MR3Nums(MRJob):
def mapper(self,_, line):
pass
def reducer(self,key, values):
pass
if __name__ == "__main__":
MR3Nums.run()
映射器应用于每一行,并应对每一个3字序列进行计数,即产生3字序列,同时计数为1 使用
键
和值
调用减速机,其中键
是一个3字序列,值
是一个计数列表(可能是一个1的列表)。reducer可以简单地返回3字序列的元组和总出现次数,后者通过sum获得
class MR3Nums(MRJob):
def mapper(self, _, line):
sequence_length = 3
words = line.strip().split()
for i in range(len(words) - sequence_length + 1):
yield " ".join(words[i:(i+sequence_length)]), 1
def reducer(self, key, values):
yield key, sum(values)
谢谢你的回复。代码简洁而深刻,不客气。如果这解决了您的问题,请将其标记为正确答案。