用GZ文件理解Hadoop行为

用GZ文件理解Hadoop行为,hadoop,Hadoop,我在S3存储桶中的两个单独文件夹中有一个小JSON文件。我分别用同一个映射器在这两个服务器上运行了相同的命令 普通JSON $ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://mybucket/normaltest -output smalltest-outpu

我在S3存储桶中的两个单独文件夹中有一个小JSON文件。我分别用同一个映射器在这两个服务器上运行了相同的命令

普通JSON

$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://mybucket/normaltest -output smalltest-output
14/08/28 08:33:53 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar6225144044327095484/] [] /tmp/streamjob6947060448653690043.jar tmpDir=null
14/08/28 08:33:56 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:33:56 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:33:56 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:33:56 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:33:56 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:33:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:33:56 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:33:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:33:56 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:33:56 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:33:58 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:33:58 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:33:58 INFO streaming.StreamJob: Running job: job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:33:58 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0053
14/08/28 08:33:59 INFO streaming.StreamJob:  map 0%  reduce 0%
14/08/28 08:34:23 INFO streaming.StreamJob:  map 1%  reduce 0%
14/08/28 08:34:26 INFO streaming.StreamJob:  map 2%  reduce 0%
14/08/28 08:34:29 INFO streaming.StreamJob:  map 9%  reduce 0%
14/08/28 08:34:32 INFO streaming.StreamJob:  map 45%  reduce 0%
14/08/28 08:34:35 INFO streaming.StreamJob:  map 56%  reduce 0%
14/08/28 08:34:36 INFO streaming.StreamJob:  map 57%  reduce 0%
14/08/28 08:34:38 INFO streaming.StreamJob:  map 84%  reduce 0%
14/08/28 08:34:39 INFO streaming.StreamJob:  map 85%  reduce 0%
14/08/28 08:34:41 INFO streaming.StreamJob:  map 99%  reduce 0%
14/08/28 08:34:44 INFO streaming.StreamJob:  map 100%  reduce 0%
14/08/28 08:34:50 INFO streaming.StreamJob:  map 100%  reduce 100%
14/08/28 08:34:50 INFO streaming.StreamJob: Job complete: job_201408260907_0053
14/08/28 08:34:50 INFO streaming.StreamJob: Output: smalltest-output
$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://weblablatency/gztest -output smalltest-output
14/08/28 08:39:45 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar2539293594337011579/] [] /tmp/streamjob301144784484156113.jar tmpDir=null
14/08/28 08:39:48 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:39:48 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:39:48 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:39:48 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:39:48 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:39:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:39:48 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:39:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:39:48 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:39:48 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:39:50 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:39:51 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:39:51 INFO streaming.StreamJob: Running job: job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:39:51 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0055
14/08/28 08:39:52 INFO streaming.StreamJob:  map 0%  reduce 0%
14/08/28 08:40:20 INFO streaming.StreamJob:  map 100%  reduce 0%
14/08/28 08:40:26 INFO streaming.StreamJob:  map 100%  reduce 100%
14/08/28 08:40:26 INFO streaming.StreamJob: Job complete: job_201408260907_0055
smalltestoutput
中,我得到几个小文件,其中包含已处理JSON的一部分

GZIPed JSON

$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://mybucket/normaltest -output smalltest-output
14/08/28 08:33:53 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar6225144044327095484/] [] /tmp/streamjob6947060448653690043.jar tmpDir=null
14/08/28 08:33:56 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:33:56 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:33:56 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:33:56 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:33:56 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:33:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:33:56 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:33:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:33:56 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:33:56 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:33:58 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:33:58 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:33:58 INFO streaming.StreamJob: Running job: job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:33:58 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0053
14/08/28 08:33:59 INFO streaming.StreamJob:  map 0%  reduce 0%
14/08/28 08:34:23 INFO streaming.StreamJob:  map 1%  reduce 0%
14/08/28 08:34:26 INFO streaming.StreamJob:  map 2%  reduce 0%
14/08/28 08:34:29 INFO streaming.StreamJob:  map 9%  reduce 0%
14/08/28 08:34:32 INFO streaming.StreamJob:  map 45%  reduce 0%
14/08/28 08:34:35 INFO streaming.StreamJob:  map 56%  reduce 0%
14/08/28 08:34:36 INFO streaming.StreamJob:  map 57%  reduce 0%
14/08/28 08:34:38 INFO streaming.StreamJob:  map 84%  reduce 0%
14/08/28 08:34:39 INFO streaming.StreamJob:  map 85%  reduce 0%
14/08/28 08:34:41 INFO streaming.StreamJob:  map 99%  reduce 0%
14/08/28 08:34:44 INFO streaming.StreamJob:  map 100%  reduce 0%
14/08/28 08:34:50 INFO streaming.StreamJob:  map 100%  reduce 100%
14/08/28 08:34:50 INFO streaming.StreamJob: Job complete: job_201408260907_0053
14/08/28 08:34:50 INFO streaming.StreamJob: Output: smalltest-output
$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://weblablatency/gztest -output smalltest-output
14/08/28 08:39:45 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar2539293594337011579/] [] /tmp/streamjob301144784484156113.jar tmpDir=null
14/08/28 08:39:48 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:39:48 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:39:48 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:39:48 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:39:48 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:39:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:39:48 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:39:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:39:48 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:39:48 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:39:50 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:39:51 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:39:51 INFO streaming.StreamJob: Running job: job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:39:51 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0055
14/08/28 08:39:52 INFO streaming.StreamJob:  map 0%  reduce 0%
14/08/28 08:40:20 INFO streaming.StreamJob:  map 100%  reduce 0%
14/08/28 08:40:26 INFO streaming.StreamJob:  map 100%  reduce 100%
14/08/28 08:40:26 INFO streaming.StreamJob: Job complete: job_201408260907_0055
在smalltest输出中,我得到一个正确解析的文件,但作为单个文件

为什么会有这种差异?发生了什么?在
gz
案例中,我的工作分配是否不当


在我的实际用例中,我需要处理总计约4GB未压缩的2000个gz文件;每4小时一次。因此,由于压缩,我无法承受任何性能问题。

Gzip是不可拆分的。你会发现大量的文章和问题都在谈论这个问题,所以我就不详细了

你的选择是:

  • 不使用Gzip(不压缩或使用其他可拆分压缩格式)
  • 使用hack使GZip可拆分,如。每个映射程序仍然必须从一开始就读取文件,因此这是一种折衷。阅读文档了解更多信息

这取决于你做了什么,但对于大多数人来说,4GB的数据处理根本不算什么。我会确保我的用例真的需要一个象Hadoop一样的东西。它是可扩展的,但很复杂,工作起来很痛苦,对于小数据集来说通常速度很慢。

因此,考虑到我在输入目录中为Hadoop提供了6000个文件,我可以期望它将负载正确地划分为整个文件,如果不是文件块,对吗?是的,正如Clement指出的,这已经被谈论了无数次。我想补充一点,压缩一般不会减慢工作速度,事实上它可以加快工作速度。这是因为现代cpu和库解压数据的速度比读取磁盘的速度快——通常是磁盘io上的作业瓶颈,而不是cpu。