Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/jsp/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
snappy文件上的hadoop python作业生成0大小的输出_Hadoop_Hadoop Streaming_Mrjob - Fatal编程技术网

snappy文件上的hadoop python作业生成0大小的输出

snappy文件上的hadoop python作业生成0大小的输出,hadoop,hadoop-streaming,mrjob,Hadoop,Hadoop Streaming,Mrjob,当我在文本文件上使用hadoop流运行wordcount.py(python mrjob)时,它会给我输出,但当对.snappy文件运行相同的操作时,我得到的输出大小为零 尝试的选项: [testgen word_count]# cat mrjob.conf runners: hadoop: # this will work for both hadoop and emr jobconf: mapreduce.task.timeout: 3600000 #m

当我在文本文件上使用hadoop流运行wordcount.py(python mrjob)时,它会给我输出,但当对.snappy文件运行相同的操作时,我得到的输出大小为零

尝试的选项:

[testgen word_count]# cat mrjob.conf 
runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      mapreduce.task.timeout: 3600000
      #mapreduce.max.split.size: 20971520
      #mapreduce.input.fileinputformat.split.maxsize: 102400
      #mapreduce.map.memory.mb: 8192
      mapred.map.child.java.opts: -Xmx4294967296
      mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      # "true" must be a string argument, not a boolean! (#323)
      #mapreduce.output.compress: "true"
      #mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

[testgen word_count]# 
命令:

[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf 
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
 mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job  -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP:  map 0%  reduce 0%
HADOOP:  map 100%  reduce 0%
HADOOP:  map 100%  reduce 11%
HADOOP:  map 100%  reduce 97%
HADOOP:  map 100%  reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
  (no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output

removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]# 
没有抛出错误,作业输出成功,已在作业统计信息中验证作业配置


是否有其他故障排除方法?

我认为您使用的选项不正确

mrjob.conf
文件中:

  • mapreduce.output.compress:“true”表示需要压缩输出
  • mapreduce.output.compression.codec:org.apache.hadoop.io.compress.SnappyCodec表示压缩使用Snappy编解码器
  • 显然,您希望映射程序能够正确读取压缩的输入。不幸的是,它不是这样工作的。如果您真的想为您的工作提供压缩数据,您可以查看SequenceFile。另一个更简单的解决方案是只给你的工作提供文本文件

    还要配置输入格式吗,比如
    mapreduce.input.compression.codec:org.apache.hadoop.io.compress.SnappyCodec


    [编辑:您还应删除定义选项行开头的符号
    #
    。否则,它们将被忽略]

    感谢您的输入Yann,但最后插入作业脚本的下一行解决了问题

    HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'
    
    HADOOP\u输入\u格式=“”