Python上的Topn记录MapReduce
我是MapReduce的新手,我有一个非常简单的问题。我解决了字数问题,然后我想将问题更改为文本上的Top N记录。虽然我对文本上的所有单词进行排序,但我不能取最后的N值。首先,我阅读文本并用1将每个单词发送给reducer,然后reducer为每个不同的单词找到数字单词。然后我试着根据单词的出现情况对这些单词进行排序。但我找不到排名前N的唱片Python上的Topn记录MapReduce,python,mapreduce,mrjob,Python,Mapreduce,Mrjob,我是MapReduce的新手,我有一个非常简单的问题。我解决了字数问题,然后我想将问题更改为文本上的Top N记录。虽然我对文本上的所有单词进行排序,但我不能取最后的N值。首先,我阅读文本并用1将每个单词发送给reducer,然后reducer为每个不同的单词找到数字单词。然后我试着根据单词的出现情况对这些单词进行排序。但我找不到排名前N的唱片 from mrjob.job import MRJob from mrjob.step import MRStep from stemming.port
from mrjob.job import MRJob
from mrjob.step import MRStep
from stemming.porter2 import stem
class MRWordCount(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper,
reducer=self.reducer),
MRStep(mapper=self.secondmapper,
reducer = self.secondreducer)
]
def mapper(self,_,lines):
words = lines.strip().split()
for w in words:
yield stem(w.lower()),1
def reducer(self, key, values):
yield key, (sum(values))
def secondmapper(self, key,value):
yield '%04d'%int(value), key
def secondreducer(self, key, values):
for v in values:
yield v,key
if __name__ == '__main__':
MRWordCount.run()
我用下面的代码解决了这个问题
from mrjob.job import MRJob
from mrjob.step import MRStep
from stemming.porter2 import stem
class MRWordCount(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper,
reducer=self.reducer),
MRStep(reducer = self.secondreducer)
]
def mapper(self,_,lines):
words = lines.strip().split()
for w in words:
w = unicode(w, "utf-8", errors="ignore")
yield stem(w.lower()),1
def reducer(self, key, values):
yield None, ('%04d'%int(sum(values)),key)
def secondreducer(self, key, values):
self.aList= []
for v in values:
self.aList.append(v)
count = len(self.aList)
for m in range(count-5,count):
yield self.aList[m]
if __name__ == '__main__':
MRWordCount.run()