Sorting Hadoop-在MapReduce中全局排序平均值和发生时间

Sorting Hadoop-在MapReduce中全局排序平均值和发生时间,sorting,mapreduce,hadoop2,reduce,hadoop-streaming,Sorting,Mapreduce,Hadoop2,Reduce,Hadoop Streaming,我正在使用Hadoop流媒体JAR来进行字数,我想知道如何获得全局排序,根据中另一个问题的答案,我发现当我们使用一个减缩器时,我们可以得到全局排序,但在我的结果中,numReduceTasks=1(一个减缩器)这不是什么好事 例如,我对mapper的输入是: #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split

我正在使用Hadoop流媒体JAR来进行字数,我想知道如何获得全局排序,根据中另一个问题的答案,我发现当我们使用一个减缩器时,我们可以得到全局排序,但在我的结果中,
numReduceTasks=1
(一个减缩器)这不是什么好事

例如,我对mapper的输入是:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)
#!/usr/bin/env python

import sys
 
word2count = {} 

for line in sys.stdin:
   
    line = line.strip()
 
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )
档案1:很久以前,在一个遥远的星系里

档案2:《星球大战》的另一集

结果是:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)
#!/usr/bin/env python

import sys
 
word2count = {} 

for line in sys.stdin:
   
    line = line.strip()
 
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )
A 1

a 1

星1

1年前

远2

客场1

时间1

战争1

长1

另一个1

在1

第一集

银河1

但这不是一个全局排序

那么,洗牌和排序以及全局排序中排序的含义是什么呢

映射程序代码:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)
#!/usr/bin/env python

import sys
 
word2count = {} 

for line in sys.stdin:
   
    line = line.strip()
 
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )
减速器代码:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)
#!/usr/bin/env python

import sys
 
word2count = {} 

for line in sys.stdin:
   
    line = line.strip()
 
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )
我使用此命令来运行它:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1

@我编辑我的问题。但我不明白你对司机的看法!