Python 映射程序代码使用unix管道运行,但不使用hadoop流。错误:A。流式处理命令失败

Python 映射程序代码使用unix管道运行,但不使用hadoop流。错误:A。流式处理命令失败,python,unix,hadoop,mapreduce,stdin,Python,Unix,Hadoop,Mapreduce,Stdin,我正在尝试Hadoop流媒体中的倒排单词列表问题(对于每个单词,输出是包含该单词的文件名列表)。Input是包含文本文件的目录的名称。我已经用python编写了mapper和reducer,在尝试使用unix管道时,它们工作得很好。但是,当使用Hadoop streaming命令执行时,代码会运行,但作业最终会失败。我怀疑这是映射程序代码中的问题,但似乎不知道到底是什么问题 我是一个在VMware Fusion上使用Cloudera培训的初学者(因此,如果我做得不对,请原谅)。我将Mapper和

我正在尝试Hadoop流媒体中的倒排单词列表问题(对于每个单词,输出是包含该单词的文件名列表)。Input是包含文本文件的目录的名称。我已经用python编写了mapper和reducer,在尝试使用unix管道时,它们工作得很好。但是,当使用Hadoop streaming命令执行时,代码会运行,但作业最终会失败。我怀疑这是映射程序代码中的问题,但似乎不知道到底是什么问题

我是一个在VMware Fusion上使用Cloudera培训的初学者(因此,如果我做得不对,请原谅)。我将Mapper和Reducer.py可执行文件放在本地系统和hdfs的主目录中。我有hdfs上的“莎士比亚”目录。下面的unix管道命令工作正常

echo shakespeare |/InvertedMapper.py | sort |/InvertedReducer.py

#MAPPER CODE

#!/usr/bin/env python

import sys
import os

class Mapper(object):

        def __init__(self, stream, sep='\t'):
                self.stream=stream
                self.sep=sep

        def __iter__(self):
                os.chdir(self.stream.read().strip())
                files = [os.path.abspath(f) for f in os.listdir(".")]
                for file in files:
                        yield file

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key,self.sep,value))

        def map(self):
                for file in self:
                        with open(file) as infile:
                                name = file.split("/")[-1].split(".")[0]
                                words = infile.read().strip().split()
                                for word in words:
                                        self.emit(word,name)

 if __name__ == "__main__":
        cwd = os.getcwd()
        mapper = Mapper(sys.stdin)
        mapper.map()
        os.chdir(cwd)


#REDUCER CODE

#!/usr/bin/env python

import sys
from itertools import groupby
from operator import itemgetter

class Reducer(object):
        def __init__(self, stream, sep="\t"):
                self.stream = stream
                self.sep = sep

        def __iter__(self):
                for line in self.stream:
                        try:
                                parts = line.strip().split(self.sep)
                                yield parts[0], parts[1]
                        except:
                                continue

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key, self.sep, value))

        def reduce(self):
                for key, group in groupby(self, itemgetter(0)):
                        values = []
                        for item in group:
                                values.append(item[1])
                        values = set(values)
                        values = list(values)
                        self.emit(key, values)
if __name__ == "__main__":
    reducer = Reducer(sys.stdin)
    reducer.reduce()
但是,haddop流媒体并没有

hadoop jar/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop streaming*.jar-输入莎士比亚-输出反相器列表-映射器invertedapper.py-reducer InvertedReducer.py-文件invertedapper.py-文件invertedapper.py

#MAPPER CODE

#!/usr/bin/env python

import sys
import os

class Mapper(object):

        def __init__(self, stream, sep='\t'):
                self.stream=stream
                self.sep=sep

        def __iter__(self):
                os.chdir(self.stream.read().strip())
                files = [os.path.abspath(f) for f in os.listdir(".")]
                for file in files:
                        yield file

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key,self.sep,value))

        def map(self):
                for file in self:
                        with open(file) as infile:
                                name = file.split("/")[-1].split(".")[0]
                                words = infile.read().strip().split()
                                for word in words:
                                        self.emit(word,name)

 if __name__ == "__main__":
        cwd = os.getcwd()
        mapper = Mapper(sys.stdin)
        mapper.map()
        os.chdir(cwd)


#REDUCER CODE

#!/usr/bin/env python

import sys
from itertools import groupby
from operator import itemgetter

class Reducer(object):
        def __init__(self, stream, sep="\t"):
                self.stream = stream
                self.sep = sep

        def __iter__(self):
                for line in self.stream:
                        try:
                                parts = line.strip().split(self.sep)
                                yield parts[0], parts[1]
                        except:
                                continue

        def emit(self, key, value):
                sys.stdout.write("{0}{1}{2}\n".format(key, self.sep, value))

        def reduce(self):
                for key, group in groupby(self, itemgetter(0)):
                        values = []
                        for item in group:
                                values.append(item[1])
                        values = set(values)
                        values = list(values)
                        self.emit(key, values)
if __name__ == "__main__":
    reducer = Reducer(sys.stdin)
    reducer.reduce()
下面是运行Hadoop命令的输出

packageJobJar: [InvertedMapper1.py, /tmp/hadoop-training/hadoop-unjar281431668511629942/] [] /tmp/streamjob679048425003800890.jar tmpDir=null
19/02/17 00:22:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
19/02/17 00:22:19 INFO mapred.FileInputFormat: Total input paths to process : 5
19/02/17 00:22:20 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
19/02/17 00:22:20 INFO streaming.StreamJob: Running job: job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:22:20 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:22:21 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:22:34 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:22:39 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:22:50 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:22:53 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:23:03 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:23:06 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:07 INFO streaming.StreamJob:  map 0%  reduce 0%
19/02/17 00:23:16 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:17 INFO streaming.StreamJob:  map 40%  reduce 0%
19/02/17 00:23:19 INFO streaming.StreamJob:  map 20%  reduce 0%
19/02/17 00:23:21 INFO streaming.StreamJob:  map 100%  reduce 100%
19/02/17 00:23:21 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:23:21 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:23:21 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:23:21 ERROR streaming.StreamJob: Job not successful. Error: NA
19/02/17 00:23:21 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

我不知道这是否是代码失败的原因,但FAQ指出不应该在Hadoop流中使用unix管道


我不知道这是否是代码失败的原因,但FAQ指出,在Hadoop流媒体中不应使用unix管道


命令可以是这样的:
mapred streaming-files/map\u file.py,/reduce\u file.py-mapper“python3 map\u file.py”-reducer“python3 reduce\u file.py”-输入/输入-输出/输出
。您还犯了一个错误,
self.emit(word,name)
,因为它应该是
self.emit(word,words)
。问题肯定出在映射器中,这是因为我们没有看到数据,所以我想将其添加到代码顶部:utf-8-*。希望它有帮助该命令可以是这样的:
mapred streaming-files/map\u file.py,/reduce\u file.py-mapper“python3 map\u file.py”-reducer“python3 reduce\u file.py”-输入/输入-输出/输出
。您还犯了一个错误,
self.emit(word,name)
,因为它应该是
self.emit(word,words)
。问题肯定出在映射器中,这是因为我们没有看到数据,所以我想将其添加到代码顶部:utf-8-*。希望能有帮助