使用pythonmrjob引导Amazon EMR上的依赖项
我正在尝试在AmazonEMR上使用PythonMRJOB运行MapReduce作业,但在安装依赖项时遇到了一些问题 我的工作代码:使用pythonmrjob引导Amazon EMR上的依赖项,python,mrjob,Python,Mrjob,我正在尝试在AmazonEMR上使用PythonMRJOB运行MapReduce作业,但在安装依赖项时遇到了一些问题 我的工作代码: 从mrjob.job导入mrjob 进口稀土 从规范化导入* 从计算功能导入* #一些代码 标准化和计算特征文件有很多依赖关系,包括NUMPY、SCIPY、SKEXEY、菲奥娜、……/P> 我的mrjob.conf文件: runners: emr: aws_access_key_id: xxxx aws_secret_ac
从mrjob.job导入mrjob
进口稀土
从规范化导入*
从计算功能导入*
#一些代码
标准化和计算特征文件有很多依赖关系,包括NUMPY、SCIPY、SKEXEY、菲奥娜、……/P> 我的mrjob.conf文件:
runners:
emr:
aws_access_key_id: xxxx
aws_secret_access_key: xxxx
aws_region: eu-west-1
ec2_key_pair: EMR
ec2_key_pair_file: /Users/antoinerigoureau/Documents/emr.pem
ssh_tunnel: true
ec2_instance_type: m3.xlarge
ec2_master_instance_type: m3.xlarge
num_ec2_instances: 1
cmdenv:
TZ: Europe/Paris
bootstrap_python: false
bootstrap:
- curl -s https://s3-eu-west-1.amazonaws.com/data-essence/utils/bootstrap.sh | sudo bash -s
- source /usr/local/ripple/venv/bin/activate
- sudo pip install -r req.txt#
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
upload_files:
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/normalize.py
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/compute_features.py
python_bin: /usr/local/ripple/venv/bin/python3
enable_emr_debugging: True
setup:
- source /usr/local/ripple/venv/bin/activate
local:
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
我的bootstrap.sh文件是:
#/bin/bash
set-e
集合x
yum更新-y
#安装yum软件包
yum安装-y gcc\
geos-devel\
gcc-c++\
atlas-sse3-devel\
拉帕克·德维尔\
libpng-devel\
freetype-devel\
兹利布·德维尔\
恩库塞斯德维尔酒店\
readline-devel\
补丁\
制造\
利布托\
卷曲\
openssl-devel\
屏幕
推送$HOME
#安装python
rm-rf Python-3.5.1.tgz
wgethttps://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz &&\
tar-xzvfpython-3.5.1.tgz
pushd-Python-3.5.1
/配置
make-j4
安装
邻苯二胺
导出路径=/usr/local/bin:$PATH
echo导出路径=/usr/local/bin:\$PATH:>/etc/profile.d/usr\u local\u PATH.sh
chmod+x/etc/profile.d/usr\u local\u path.sh
pip3.5安装--升级pip virtualenv
mkdir-p/usr/local/ripple/venv
virtualenv/usr/local/ripple/venv
源/usr/本地/波纹/venv/bin/激活
#安装gdal
rm-rf gdal191.zip
wgethttp://download.osgeo.org/gdal/gdal191.zip &&\
解压缩gdal191.zip
#
#下面是我必须添加的技巧,以绕过以下-fPIC错误
#/usr/bin/ld:/root/gdal-1.9.1/frmts/o/.libs/aaigridataset.o:在创建共享对象时,无法使用针对“AAIGRasterBand的vtable”的重新定位R_X86_64_32S;用-fPIC重新编译
#
pushd gdal-1.9.1
/配置
CC=“gcc-fPIC”CXX=“g++-fPIC”make-j4
安装
邻苯二胺
导出LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
echo export LD\u LIBRARY\u PATH=/usr/local/lib:\$LD\u LIBRARY\u PATH>/etc/profile.d/gdal\u LIBRARY\u PATH.sh
chmod+x/etc/profile.d/gdal\u library\u path.sh
但是,我的作业失败,输出如下:
Created new cluster j-T8UUFEZILJYQ
Waiting for step 1 of 1 (s-3SOCF1ZPWJ575) to complete...
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
Opening ssh tunnel to resource manager...
Connect to resource manager at: http://localhost:40199/cluster
RUNNING for 16.2s
Unable to connect to resource manager
RUNNING for 48.8s
FAILED
Cluster j-T8UUFEZILJYQ is TERMINATING: Shut down as step failed
Attempting to fetch counters from logs...
Looking for step log in /mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com...
Parsing step log: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
Counters: 9
Job Counters
Data-local map tasks=1
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Total megabyte-seconds taken by all map tasks=33988320
Total time spent by all map tasks (ms)=23603
Total time spent by all maps in occupied slots (ms)=1062135
Total time spent by all reduces in occupied slots (ms)=0
Total vcore-seconds taken by all map tasks=23603
Scanning logs for probable cause of failure...
Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1463748945334_0001 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com and task/core nodes...
Parsing task syslog: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
Parsing task stderr: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
Probable cause of failure:
R/W/S=1749/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=hadoop
HADOOP_USER=null
last tool output: |null|
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
(from lines 48-72 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog)
caused by:
+ /usr/local/ripple/venv/bin/python3 test_mrjob.py --step-num=0 --mapper
Traceback (most recent call last):
File "test_mrjob.py", line 2, in <module>
import numpy as np
ImportError: No module named 'numpy'
(from lines 31-35 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr)
while reading input from s3://data-essence/databerries-01/extract_essence_000000000001.gz
Step 1 of 1 failed
Killing our SSH tunnel (pid 1288)
Terminating cluster: j-T8UUFEZILJYQ
创建了新集群j-T8UUFEZILJYQ
正在等待第1步(s-3SOCF1ZPWJ575)完成。。。
挂起(群集正在启动)
挂起(群集正在启动)
挂起(群集正在启动)
挂起(群集正在启动)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
挂起(群集正在引导:运行引导操作)
正在打开到资源管理器的ssh隧道。。。
连接到以下位置的资源管理器:http://localhost:40199/cluster
运行16.2秒
无法连接到资源管理器
跑48.8秒
失败
群集j-T8UUFEZILJYQ正在终止:由于步骤失败而关闭
正在尝试从日志中获取计数器。。。
正在ec2-54-194-248-128.eu-west-1.compute.amazonaws.com上查找步骤登录/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575。。。
正在分析步骤日志:ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
柜台:9
工作计数器
数据本地映射任务=1
失败的映射任务=4
已启动的地图任务=4
其他本地地图任务=3
所有映射任务占用的总MB秒数=33988320
所有map任务花费的总时间(毫秒)=23603
所有地图在占用插槽中花费的总时间(毫秒)=1062135
占用的插槽中所有减少项花费的总时间(ms)=0
所有地图任务占用的vcore总秒数=23603
正在扫描日志以查找可能的故障原因。。。
正在ec2-54-194-248-128.eu-west-1.compute.amazonaws.com和任务/核心节点上的/mnt/var/log/hadoop/userlogs/application_1463748; 1463748945334_0001中查找任务日志。。。
正在分析任务系统日志:ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
正在分析任务stderr:ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
可能的故障原因:
R/W/S=1749/0/0英寸:不适用[rec/S]输出:不适用[rec/S]
minRecWrittenToEnableSkip_=9223372036854775807主机=null
用户=hadoop
HADOOP_USER=null
最后一个刀具输出:|空|
java.io.IOException:管道破裂
位于java.io.FileOutputStream.writeBytes(本机方法)
在java.io.FileOutputStream.write(FileOutputStream.java:345)处
在java.io.BufferedOutputStream.write处(BufferedOutputStream.java:122)
位于java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
在java.io.BufferedOutputStream.write处(BufferedOutputStream.java:126)
在java.io.DataOutputStream.write(DataOutputStream.java:107)处
位于org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
位于org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
位于org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
位于org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
位于org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
位于org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
位于org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
位于org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
位于java.security.AccessController.doPrivileged(本机方法)
在哈夫