Python Amazon EMR:Pyspark存在奇怪的依赖性问题

Python Amazon EMR:Pyspark存在奇怪的依赖性问题,python,amazon-web-services,pyspark,emr,amazon-emr,Python,Amazon Web Services,Pyspark,Emr,Amazon Emr,我在让pyspark作业在EMR集群上运行时遇到问题,因此我登录到主节点并直接在那里运行spark submit 我有一个提交给pyspark的python文件,在这个文件中我有: import subprocess from pyspark import SparkContext, SparkConf import boto3 from boto3.s3.transfer import S3Transfer import os, re import tarfile import time ..

我在让pyspark作业在EMR集群上运行时遇到问题,因此我登录到主节点并直接在那里运行spark submit

我有一个提交给pyspark的python文件,在这个文件中我有:

import subprocess
from pyspark import SparkContext, SparkConf
import boto3
from boto3.s3.transfer import S3Transfer
import os, re
import tarfile
import time
...
当我尝试在群集模式下运行时,我得到: (来自纱线原木,为简洁而修剪)

如果我打开主机上的火花壳并执行以下操作:

import boto3
client = boto3.client("s3")
它很好用

这里有虚拟环境吗?我完全糊涂了

编辑 忘了提到我正在使用Spark 1.6.0的最新EMR版本


另外,在本地模式下,这在我自己的机器上也可以正常工作。

嗯,德普,我发现了问题

事实证明,我必须
pip安装bot3
,默认情况下,EMR节点不会安装此程序


这是一个错误非常描述性的例子。

不管你的问题,考虑使用Scice的脚本来代替ECR。ECM555实际上不是这样的,这个项目背后的全部想法是通过EMR来提供这一点的容易性,并且它应该被支持。当然我可以这样做,但我必须使用安装脚本,这只是一个更痛苦的。
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-39-79.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
boto3==1.2.3
botocore==1.3.23
import boto3
client = boto3.client("s3")