Python Hadoop返回的结果少于预期_Python_Hadoop_Amazon Web Services_Emr

Python Hadoop返回的结果少于预期

python hadoop amazon-web-services

Python Hadoop返回的结果少于预期,python,hadoop,amazon-web-services,emr,Python,Hadoop,Amazon Web Services,Emr,我有两个python脚本一个mapper和reducer（基本上reducer此时只打印其他内容），而在本地我得到4个结果-字符串在hadoop上我得到3。这是怎么回事我使用AmazonElasticMapReduce使用Hadoop mapper.py #!/usr/bin/env python import sys import re import os # Constants declaration WINDOW = 10 OVERLAP = 4 START_POSITION =

我有两个python脚本一个mapper和reducer（基本上reducer此时只打印其他内容），而在本地我得到4个结果-字符串在hadoop上我得到3。这是怎么回事

我使用AmazonElasticMapReduce使用Hadoop

mapper.py

#!/usr/bin/env python

import sys
import re
import os
# Constants declaration

WINDOW = 10
OVERLAP = 4
START_POSITION = 0
END_POSITION = 0

# regular expressions

pattern = re.compile("[a-z]*", re.IGNORECASE)

a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE)
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE)
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE)
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE)

# variables initialization

converted_word = ""
next_word = ""
new_character = ""
filename = ""
prev_filename = ""
i = 0



# Read pairs as lines of input from STDIN
for line in sys.stdin:

    line.strip()

    filename = os.environ['mapreduce_map_input_file']
    filename = filename.replace("s3://source123/input/","")


    # check if its a new file, and reset start position
    if filename != prev_filename:

        START_POSITION = 0
        next_word = ""
        converted_word = ""
        prev_filename = filename

    # loop through every word that matches the pattern
    for word in pattern.findall(line):


                new_character = convert(word)
                converted_word = converted_word + new_character

                if len(converted_word) > (WINDOW - OVERLAP):
                    next_word = next_word + new_character

                # print "word= ", word
                # print "converted_word= ", converted_word
            else:

                END_POSITION = START_POSITION + (len(converted_word) - 1)

                print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION)

                START_POSITION = START_POSITION + (WINDOW - OVERLAP)
                new_character = convert(word)
                converted_word = next_word + new_character

日志

映射器任务将其输入转换为行，并将这些行提供给进程的stdin

在这种情况下，您有多个输入文件，并且假设来自不同文件的所有行都是按顺序输入的（即逐文件输入），但它们可能是并行处理的，因此映射器（获取两个输入文件）可能会通过顺序分布重置其计数器。

那么我如何调整脚本？第一个想法可能是将您的

上一个文件名

转换为以文件名为键的字典，并测试字典是否有键。。。

2016-04-27 19:58:41,293 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461784308237 
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-KCDMFZJGYO89:i-995f5a41:RunJar:16480 period:60 /mnt/var/em/raw/i-995f5a41_20160427_RunJar_16480_raw.bin
2016-04-27 19:58:43,477 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032
2016-04-27 19:58:43,673 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032
2016-04-27 19:58:44,156 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/mapper.py' for reading
2016-04-27 19:58:44,267 INFO amazon.emr.metrics.MetricsSaver (main): Thread 1 created MetricsLockFreeSaver 1
2016-04-27 19:58:44,439 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/source_reducer.py' for reading
2016-04-27 19:58:44,628 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2016-04-27 19:58:44,630 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 426d94a07125cf9447bb0c2b336cf10b4c254375]
2016-04-27 19:58:45,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): listStatus s3://source123/input with recursive false
2016-04-27 19:58:45,265 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2016-04-27 19:58:45,336 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2016-04-27 19:58:45,565 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1461784297295_0004
2016-04-27 19:58:45,710 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1461784297295_0004
2016-04-27 19:58:45,743 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-38-52.us-west-2.compute.internal:20888/proxy/application_1461784297295_0004/
2016-04-27 19:58:45,744 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1461784297295_0004
2016-04-27 19:58:53,876 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 running in uber mode : false
2016-04-27 19:58:53,877 INFO org.apache.hadoop.mapreduce.Job (main):  map 0% reduce 0%
2016-04-27 19:59:11,063 INFO org.apache.hadoop.mapreduce.Job (main):  map 11% reduce 0%
2016-04-27 19:59:14,081 INFO org.apache.hadoop.mapreduce.Job (main):  map 22% reduce 0%
2016-04-27 19:59:16,094 INFO org.apache.hadoop.mapreduce.Job (main):  map 33% reduce 0%
2016-04-27 19:59:18,106 INFO org.apache.hadoop.mapreduce.Job (main):  map 56% reduce 0%
2016-04-27 19:59:19,114 INFO org.apache.hadoop.mapreduce.Job (main):  map 67% reduce 0%
2016-04-27 19:59:26,159 INFO org.apache.hadoop.mapreduce.Job (main):  map 78% reduce 0%
2016-04-27 19:59:29,178 INFO org.apache.hadoop.mapreduce.Job (main):  map 89% reduce 0%
2016-04-27 19:59:30,184 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 0%
2016-04-27 19:59:32,196 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 33%
2016-04-27 19:59:34,207 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 67%
2016-04-27 19:59:38,228 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 100%
2016-04-27 19:59:40,246 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 completed successfully
2016-04-27 19:59:40,409 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 55
    File System Counters
        FILE: Number of bytes read=190
        FILE: Number of bytes written=1541379
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=873
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        S3: Number of bytes read=864
        S3: Number of bytes written=130
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
    Job Counters 
        Killed map tasks=1
        Launched map tasks=9
        Launched reduce tasks=3
        Data-local map tasks=9
        Total time spent by all maps in occupied slots (ms)=6351210
        Total time spent by all reduces in occupied slots (ms)=2449170
        Total time spent by all map tasks (ms)=141138
        Total time spent by all reduce tasks (ms)=27213
        Total vcore-milliseconds taken by all map tasks=141138
        Total vcore-milliseconds taken by all reduce tasks=27213
        Total megabyte-milliseconds taken by all map tasks=203238720
        Total megabyte-milliseconds taken by all reduce tasks=78373440
    Map-Reduce Framework
        Map input records=5
        Map output records=3
        Map output bytes=124
        Map output materialized bytes=562
        Input split bytes=873
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=562
        Reduce input records=3
        Reduce output records=6
        Spilled Records=6
        Shuffled Maps =27
        Failed Shuffles=0
        Merged Map outputs=27
        GC time elapsed (ms)=2785
        CPU time spent (ms)=11670
        Physical memory (bytes) snapshot=5282500608
        Virtual memory (bytes) snapshot=28472725504
        Total committed heap usage (bytes)=5977407488
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=864
    File Output Format Counters 
        Bytes Written=130
2016-04-27 19:59:40,409 INFO org.apache.hadoop.streaming.StreamJob (main): Output directory: s3://source123/output/