Sorting Hadoop-在MapReduce中全局排序平均值和发生时间_Sorting_Mapreduce_Hadoop2_Reduce_Hadoop Streaming

Sorting Hadoop-在MapReduce中全局排序平均值和发生时间

sorting mapreduce

Sorting Hadoop-在MapReduce中全局排序平均值和发生时间,sorting,mapreduce,hadoop2,reduce,hadoop-streaming,Sorting,Mapreduce,Hadoop2,Reduce,Hadoop Streaming,我正在使用Hadoop流媒体JAR来进行字数，我想知道如何获得全局排序，根据中另一个问题的答案，我发现当我们使用一个减缩器时，我们可以得到全局排序，但在我的结果中，numReduceTasks=1（一个减缩器）这不是什么好事例如，我对mapper的输入是： #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split

我正在使用Hadoop流媒体JAR来进行字数，我想知道如何获得全局排序，根据中另一个问题的答案，我发现当我们使用一个减缩器时，我们可以得到全局排序，但在我的结果中，
numReduceTasks=1
（一个减缩器）这不是什么好事
例如，我对mapper的输入是：

#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)

#!/usr/bin/env python import sys word2count = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] )
档案1：很久以前，在一个遥远的星系里
档案2：《星球大战》的另一集
结果是：

#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)

#!/usr/bin/env python import sys word2count = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] )
A 1
a 1
星1
1年前
一
远2
客场1
时间1
战争1
长1
另一个1
在1
第一集
银河1
但这不是一个全局排序
那么，洗牌和排序以及全局排序中排序的含义是什么呢
映射程序代码：

#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)

#!/usr/bin/env python import sys word2count = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] )
减速器代码：

#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)

#!/usr/bin/env python import sys word2count = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] )
我使用此命令来运行它：

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -input /user/cloudera/input \ -output /user/cloudera/output_new_0 \ -mapper /home/cloudera/wordcount_mapper.py \ -reducer /home/cloudera/wordcount_reducer.py \ -numReduceTasks=1

@我编辑我的问题。但我不明白你对司机的看法！