Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
无法使用AWS上的流式python map reduce通过stdin读取Hadoop序列文件_Python_Amazon Web Services_Hadoop Streaming_Elastic Map Reduce - Fatal编程技术网

无法使用AWS上的流式python map reduce通过stdin读取Hadoop序列文件

无法使用AWS上的流式python map reduce通过stdin读取Hadoop序列文件,python,amazon-web-services,hadoop-streaming,elastic-map-reduce,Python,Amazon Web Services,Hadoop Streaming,Elastic Map Reduce,我试图在Amazon的Elastic map reduce上运行一个简单的单词计数map reduce作业,但是输出是乱七八糟的。输入文件是hadoop序列文件的一部分。该文件应该是从已爬网的网页中提取的文本(从html中剥离) 我的AWS Elastic MapReduce步骤如下所示: Mapper: s3://com.gpanterov.scripts/mapper.py Reducer: s3://com.gpanterov.scripts/reducer.py Input S3 loc

我试图在Amazon的Elastic map reduce上运行一个简单的单词计数map reduce作业,但是输出是乱七八糟的。输入文件是hadoop序列文件的一部分。该文件应该是从已爬网的网页中提取的文本(从html中剥离)

我的AWS Elastic MapReduce步骤如下所示:

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/
'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'
HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat
作业成功运行,但输出乱七八糟。只有奇怪的符号,没有任何文字。我猜这是因为hadoop序列文件无法通过标准格式读取?但是,如何在这样的文件上运行mr作业?我们必须先把序列文件转换成文本文件吗

第-00000部分的前几行如下所示:

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/
'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'
HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat
这是我的地图:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    words = line.split()
    for word in words:
      print word + "\t" + str(1)
还有我的减速机:

#!/usr/bin/env python

import sys

def output(previous_key, total):
    if previous_key != None:
      print previous_key + " was found " + str(total) + " times"

previous_key = None
total = 0

for line in sys.stdin:
    key, value = line.split("\t", 1)
    if key != previous_key:
      output(previous_key, total)
      previous_key = key
      total = 0 
    total += int(value)

output(previous_key, total)
输入文件没有问题。在本地机器上,我运行了hadoop fs-text textData-00112 | less,它从网页返回纯文本。
对于如何在这些类型的输入文件(常见的抓取hadoop序列文件)上运行python streaming mapreduce作业的任何输入都非常感谢。

您需要向hadoop streaming jar提供
SequenceFileAsTextInputFormat
作为
inputformat

我从未使用过amazon aws mapreduce,但在正常的hadoop安装中,它会这样做:

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/
'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'
HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat
HADOOP=$HADOOP\u HOME/bin/HADOOP
$HADOOP jar$HADOOP_HOME/contrib/streaming/HADOOP-*-streaming.jar\
-输入
-输出\
-映射器“mapper.py”\
-减速器“减速器.py”\
-inputformat序列文件AstextInputFormat

Sunny Nanda的建议解决了这个问题。添加
-inputformat SequenceFileAsTextInputFormat
到aws elastic mapreduce API中的“额外参数”框,作业的输出与预期一致