使用pythonmrjob引导Amazon EMR上的依赖项_Python_Mrjob

使用pythonmrjob引导Amazon EMR上的依赖项

python

使用pythonmrjob引导Amazon EMR上的依赖项,python,mrjob,Python,Mrjob,我正在尝试在AmazonEMR上使用PythonMRJOB运行MapReduce作业，但在安装依赖项时遇到了一些问题我的工作代码：从mrjob.job导入mrjob 进口稀土从规范化导入* 从计算功能导入* #一些代码标准化和计算特征文件有很多依赖关系，包括NUMPY、SCIPY、SKEXEY、菲奥娜、……/P> 我的mrjob.conf文件： runners: emr: aws_access_key_id: xxxx aws_secret_ac

我正在尝试在AmazonEMR上使用PythonMRJOB运行MapReduce作业，但在安装依赖项时遇到了一些问题

我的工作代码：

从mrjob.job导入mrjob
进口稀土
从规范化导入*
从计算功能导入*
#一些代码

标准化和计算特征文件有很多依赖关系，包括NUMPY、SCIPY、SKEXEY、菲奥娜、……/P> 我的mrjob.conf文件：

runners:
    emr:
        aws_access_key_id: xxxx
        aws_secret_access_key: xxxx
        aws_region: eu-west-1
        ec2_key_pair: EMR
        ec2_key_pair_file: /Users/antoinerigoureau/Documents/emr.pem
        ssh_tunnel: true
        ec2_instance_type: m3.xlarge
        ec2_master_instance_type: m3.xlarge
        num_ec2_instances: 1
        cmdenv:
            TZ: Europe/Paris
        bootstrap_python: false
        bootstrap:
        - curl -s https://s3-eu-west-1.amazonaws.com/data-essence/utils/bootstrap.sh | sudo bash -s
        - source /usr/local/ripple/venv/bin/activate
        - sudo pip install -r req.txt#
        upload_archives:
        - /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
        upload_files:
        - /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/normalize.py
        - /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/compute_features.py
        python_bin: /usr/local/ripple/venv/bin/python3
        enable_emr_debugging: True
        setup:
        - source /usr/local/ripple/venv/bin/activate
    local:
        upload_archives:
        - /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data

我的bootstrap.sh文件是：

#/bin/bash
set-e
集合x
yum更新-y
#安装yum软件包
yum安装-y gcc\
geos-devel\
gcc-c++\
atlas-sse3-devel\
拉帕克·德维尔\
libpng-devel\
freetype-devel\
兹利布·德维尔\
恩库塞斯德维尔酒店\
readline-devel\
补丁\
制造\
利布托\
卷曲\
openssl-devel\
屏幕
推送$HOME
#安装python
rm-rf Python-3.5.1.tgz
wgethttps://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz &&\
tar-xzvfpython-3.5.1.tgz
pushd-Python-3.5.1
/配置
make-j4
安装
邻苯二胺
导出路径=/usr/local/bin:$PATH
echo导出路径=/usr/local/bin:\$PATH:>/etc/profile.d/usr\u local\u PATH.sh
chmod+x/etc/profile.d/usr\u local\u path.sh
pip3.5安装--升级pip virtualenv
mkdir-p/usr/local/ripple/venv
virtualenv/usr/local/ripple/venv
源/usr/本地/波纹/venv/bin/激活
#安装gdal
rm-rf gdal191.zip
wgethttp://download.osgeo.org/gdal/gdal191.zip &&\
解压缩gdal191.zip
#
#下面是我必须添加的技巧，以绕过以下-fPIC错误
#/usr/bin/ld:/root/gdal-1.9.1/frmts/o/.libs/aaigridataset.o:在创建共享对象时，无法使用针对“AAIGRasterBand的vtable”的重新定位R_X86_64_32S；用-fPIC重新编译
#
pushd gdal-1.9.1
/配置
CC=“gcc-fPIC”CXX=“g++-fPIC”make-j4
安装
邻苯二胺
导出LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
echo export LD\u LIBRARY\u PATH=/usr/local/lib:\$LD\u LIBRARY\u PATH>/etc/profile.d/gdal\u LIBRARY\u PATH.sh
chmod+x/etc/profile.d/gdal\u library\u path.sh

但是，我的作业失败，输出如下：

Created new cluster j-T8UUFEZILJYQ
Waiting for step 1 of 1 (s-3SOCF1ZPWJ575) to complete...
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  Opening ssh tunnel to resource manager...
  Connect to resource manager at: http://localhost:40199/cluster
  RUNNING for 16.2s
Unable to connect to resource manager
  RUNNING for 48.8s
  FAILED
Cluster j-T8UUFEZILJYQ is TERMINATING: Shut down as step failed
Attempting to fetch counters from logs...
Looking for step log in /mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com...
  Parsing step log: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
Counters: 9
    Job Counters 
        Data-local map tasks=1
        Failed map tasks=4
        Launched map tasks=4
        Other local map tasks=3
        Total megabyte-seconds taken by all map tasks=33988320
        Total time spent by all map tasks (ms)=23603
        Total time spent by all maps in occupied slots (ms)=1062135
        Total time spent by all reduces in occupied slots (ms)=0
        Total vcore-seconds taken by all map tasks=23603
Scanning logs for probable cause of failure...
Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1463748945334_0001 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com and task/core nodes...
  Parsing task syslog: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
  Parsing task stderr: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
Probable cause of failure:

R/W/S=1749/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=hadoop
HADOOP_USER=null
last tool output: |null|

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:345)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

(from lines 48-72 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog)

caused by:

+ /usr/local/ripple/venv/bin/python3 test_mrjob.py --step-num=0 --mapper
Traceback (most recent call last):
  File "test_mrjob.py", line 2, in <module>
    import numpy as np
ImportError: No module named 'numpy'

(from lines 31-35 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr)

while reading input from s3://data-essence/databerries-01/extract_essence_000000000001.gz


Step 1 of 1 failed
Killing our SSH tunnel (pid 1288)
Terminating cluster: j-T8UUFEZILJYQ

创建了新集群j-T8UUFEZILJYQ
正在等待第1步（s-3SOCF1ZPWJ575）完成。。。
挂起（群集正在启动）
挂起（群集正在启动）
挂起（群集正在启动）
挂起（群集正在启动）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
挂起（群集正在引导：运行引导操作）
正在打开到资源管理器的ssh隧道。。。
连接到以下位置的资源管理器：http://localhost:40199/cluster
运行16.2秒
无法连接到资源管理器
跑48.8秒
失败
群集j-T8UUFEZILJYQ正在终止：由于步骤失败而关闭
正在尝试从日志中获取计数器。。。
正在ec2-54-194-248-128.eu-west-1.compute.amazonaws.com上查找步骤登录/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575。。。
正在分析步骤日志：ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
柜台：9
工作计数器
数据本地映射任务=1
失败的映射任务=4
已启动的地图任务=4
其他本地地图任务=3
所有映射任务占用的总MB秒数=33988320
所有map任务花费的总时间（毫秒）=23603
所有地图在占用插槽中花费的总时间（毫秒）=1062135
占用的插槽中所有减少项花费的总时间（ms）=0
所有地图任务占用的vcore总秒数=23603
正在扫描日志以查找可能的故障原因。。。
正在ec2-54-194-248-128.eu-west-1.compute.amazonaws.com和任务/核心节点上的/mnt/var/log/hadoop/userlogs/application_1463748; 1463748945334_0001中查找任务日志。。。
正在分析任务系统日志：ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
正在分析任务stderr：ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
可能的故障原因：
R/W/S=1749/0/0英寸：不适用[rec/S]输出：不适用[rec/S]
minRecWrittenToEnableSkip_=9223372036854775807主机=null
用户=hadoop
HADOOP_USER=null
最后一个刀具输出：|空|
java.io.IOException:管道破裂
位于java.io.FileOutputStream.writeBytes（本机方法）
在java.io.FileOutputStream.write（FileOutputStream.java:345）处
在java.io.BufferedOutputStream.write处（BufferedOutputStream.java:122）
位于java.io.BufferedOutputStream.flushBuffer（BufferedOutputStream.java:82）
在java.io.BufferedOutputStream.write处（BufferedOutputStream.java:126）
在java.io.DataOutputStream.write（DataOutputStream.java:107）处
位于org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8（TextInputWriter.java:72）
位于org.apache.hadoop.streaming.io.TextInputWriter.writeValue（TextInputWriter.java:51）
位于org.apache.hadoop.streaming.PipeMapper.map（PipeMapper.java:106）
位于org.apache.hadoop.mapred.MapRunner.run（MapRunner.java:65）
位于org.apache.hadoop.streaming.PipeMapRunner.run（PipeMapRunner.java:34）
位于org.apache.hadoop.mapred.MapTask.runOldMapper（MapTask.java:432）
位于org.apache.hadoop.mapred.MapTask.run（MapTask.java:343）
位于org.apache.hadoop.mapred.YarnChild$2.run（YarnChild.java:175）
位于java.security.AccessController.doPrivileged（本机方法）
在哈夫