Python 如何使用Hadoop流在本地Hadoop集群中运行MRJob?
我目前正在学习一门大数据课程,我的一个项目是在本地设置的Hadoop集群上运行我的Mapper/Reducer 我一直在使用Python和类的MRJob库 下面是我目前为Mapper/Reducer编写的Python代码Python 如何使用Hadoop流在本地Hadoop集群中运行MRJob?,python,hadoop,mrjob,Python,Hadoop,Mrjob,我目前正在学习一门大数据课程,我的一个项目是在本地设置的Hadoop集群上运行我的Mapper/Reducer 我一直在使用Python和类的MRJob库 下面是我目前为Mapper/Reducer编写的Python代码 from mrjob.job import MRJob from mrjob.step import MRStep import re import os WORD_RE = re.compile(r"[\w']+") choice = "" class MRPreposi
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import os
WORD_RE = re.compile(r"[\w']+")
choice = ""
class MRPrepositionsFinder(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_words),
MRStep(reducer=self.reducer_find_prep_word)
]
def mapper_get_words(self, _, line):
# set word_list to indicators, convert to lowercase, and strip whitespace
word_list = set(line.lower().strip() for line in open("/hdfs/user/user/indicators.txt"))
# set filename to map_input_file
fileName = os.environ['map_input_file']
# itterate through each word in line
for word in WORD_RE.findall(line):
# if word is in indicators, yield chocie as filename
if word.lower() in word_list:
choice = fileName.split('/')[5]
yield (choice, 1)
def reducer_find_prep_word(self, choice, counts):
# each item of choice is (choice, count),
# so yielding results in value=choice, key=count
yield (choice, sum(counts))
if __name__ == '__main__':
MRPrepositionsFinder.run()
当我尝试在Hadoop集群上运行代码时,我使用了以下命令:
python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output
不幸的是,每次运行该命令时,都会出现以下错误:
No configs found; falling back on auto-configuration
STDERR: Error: JAVA_HOME is not set and could not be found.
Traceback (most recent call last):
File "hrc_discover.py", line 37, in
MRPrepositionsFinder.run()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 432, in run
mr_job.execute()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 453, in execute
super(MRJob, self).execute()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 161, in execute
self.run_job()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 231, in run_job
runner.run()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/runner.py", line 437, in run
self._run()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 346, in _run
self._find_binaries_and_jars()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 361, in _find_binaries_and_jars
self.get_hadoop_version()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 198, in get_hadoop_version
return self.fs.get_hadoop_version()
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 117, in get_hadoop_version
stdout = self.invoke_hadoop(['version'], return_stdout=True)
File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 172, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'version']' returned non-zero exit status 1
我环顾了一下互联网,发现我需要导出我的JAVA_HOME变量——但我不想设置任何可能破坏我的设置的东西
在此方面的任何帮助都将不胜感激,谢谢 问题似乎出现在etc/hadoop/hadoop-env.sh脚本文件中 JAVA_HOME环境变量配置为:
export JAVA_HOME=$(JAVA_HOME)
因此,我继续将其更改为以下内容:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output
我尝试再次运行以下命令,希望它能正常工作:
python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output
谢天谢地,MRJob在JAVA_HOME环境中获得了以下输出:
No configs found; falling back on auto-configuration
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/hadoop/contrib...
Looking for Hadoop streaming jar in /usr/lib/hadoop-mapreduce...
Hadoop streaming jar not found. Use --hadoop-streaming-jar
Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218
Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/...
..
为了解决Hadoop streaming jar的问题,我在命令中添加了以下开关:
--hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
完整命令如下所示:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output
其结果为以下输出:
No configs found; falling back on auto-configuration
Using Hadoop version 2.7.3
Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218
Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/...
问题似乎已经解决,Hadoop应该处理我的工作。如果我将机器设置为以伪分布式模式运行,并将本地数据文件放入hdfs中,这会有所帮助。