Python Amazon EMR:Pyspark存在奇怪的依赖性问题
我在让pyspark作业在EMR集群上运行时遇到问题,因此我登录到主节点并直接在那里运行spark submit 我有一个提交给pyspark的python文件,在这个文件中我有:Python Amazon EMR:Pyspark存在奇怪的依赖性问题,python,amazon-web-services,pyspark,emr,amazon-emr,Python,Amazon Web Services,Pyspark,Emr,Amazon Emr,我在让pyspark作业在EMR集群上运行时遇到问题,因此我登录到主节点并直接在那里运行spark submit 我有一个提交给pyspark的python文件,在这个文件中我有: import subprocess from pyspark import SparkContext, SparkConf import boto3 from boto3.s3.transfer import S3Transfer import os, re import tarfile import time ..
import subprocess
from pyspark import SparkContext, SparkConf
import boto3
from boto3.s3.transfer import S3Transfer
import os, re
import tarfile
import time
...
当我尝试在群集模式下运行时,我得到:
(来自纱线原木,为简洁而修剪)
如果我打开主机上的火花壳并执行以下操作:
import boto3
client = boto3.client("s3")
它很好用
这里有虚拟环境吗?我完全糊涂了
编辑
忘了提到我正在使用Spark 1.6.0的最新EMR版本
另外,在本地模式下,这在我自己的机器上也可以正常工作。嗯,德普,我发现了问题 事实证明,我必须
pip安装bot3
,默认情况下,EMR节点不会安装此程序
这是一个错误非常描述性的例子。
不管你的问题,考虑使用Scice的脚本来代替ECR。ECM555实际上不是这样的,这个项目背后的全部想法是通过EMR来提供这一点的容易性,并且它应该被支持。当然我可以这样做,但我必须使用安装脚本,这只是一个更痛苦的。Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-39-79.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
boto3==1.2.3
botocore==1.3.23
import boto3
client = boto3.client("s3")