Python AWS EMR Spark“;没有名为pyspark的模块;
我创建了一个spark集群,将ssh连接到主服务器,然后启动shell:Python AWS EMR Spark“;没有名为pyspark的模块;,python,amazon-web-services,apache-spark,amazon-emr,Python,Amazon Web Services,Apache Spark,Amazon Emr,我创建了一个spark集群,将ssh连接到主服务器,然后启动shell: MASTER=yarn-client ./spark/bin/pyspark 当我执行以下操作时: x = sc.textFile("s3://location/files.*") xt = x.map(lambda x: handlejson(x)) table= sqlctx.inferSchema(xt) Error from python worker: /usr/bin/python: No module
MASTER=yarn-client ./spark/bin/pyspark
当我执行以下操作时:
x = sc.textFile("s3://location/files.*")
xt = x.map(lambda x: handlejson(x))
table= sqlctx.inferSchema(xt)
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar
java.io.EOFException
java.io.DataInputStream.readInt(DataInputStream.java:392)
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:151)
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:78)
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:54)
org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
jar -tf /home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep pyspark
pyspark/
pyspark/shuffle.py
pyspark/resultiterable.py
pyspark/files.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/java_gateway.py
pyspark/join.py
pyspark/serializers.py
pyspark/shell.py
pyspark/rddsampler.py
pyspark/rdd.py
....
我得到以下错误:
x = sc.textFile("s3://location/files.*")
xt = x.map(lambda x: handlejson(x))
table= sqlctx.inferSchema(xt)
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar
java.io.EOFException
java.io.DataInputStream.readInt(DataInputStream.java:392)
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:151)
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:78)
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:54)
org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
jar -tf /home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep pyspark
pyspark/
pyspark/shuffle.py
pyspark/resultiterable.py
pyspark/files.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/java_gateway.py
pyspark/join.py
pyspark/serializers.py
pyspark/shell.py
pyspark/rddsampler.py
pyspark/rdd.py
....
我还检查了PYTHONPATH
>>> os.environ['PYTHONPATH'] '/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip:/home/hadoop/spark/python/:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar'
在罐子里寻找pyspark,它就在那里:
x = sc.textFile("s3://location/files.*")
xt = x.map(lambda x: handlejson(x))
table= sqlctx.inferSchema(xt)
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar
java.io.EOFException
java.io.DataInputStream.readInt(DataInputStream.java:392)
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:151)
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:78)
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:54)
org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
jar -tf /home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep pyspark
pyspark/
pyspark/shuffle.py
pyspark/resultiterable.py
pyspark/files.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/java_gateway.py
pyspark/join.py
pyspark/serializers.py
pyspark/shell.py
pyspark/rddsampler.py
pyspark/rdd.py
....
以前有人碰到过这个吗?谢谢 您需要参考以下Spark问题:
这在以后基于EMR的构建中得到了修复。有关发行说明和说明,请参阅 我还检查了PYTHONGPATH env:
>os.environ['PYTHONPATH']“/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip:/home/hadoop/spark/python/:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar”
看起来还不错。另外jar-tf/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep-pyspark
表明pyspark是jar的一部分。