org.apache.spark.SparkException:pyspark.daemon'中没有端口号;斯道特

org.apache.spark.SparkException:pyspark.daemon'中没有端口号;斯道特,pyspark,Pyspark,我正在Hadoop纱线集群上执行spark提交作业 spark提交/opt/spark/examples/src/main/python/pi.py 1000 但面对下面的错误消息。看来工人还没有开始工作 2018-12-20 07:25:14 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161 2018-12-20 07:25:14 INFO DAGSched

我正在Hadoop纱线集群上执行spark提交作业

spark提交/opt/spark/examples/src/main/python/pi.py 1000

但面对下面的错误消息。看来工人还没有开始工作

  2018-12-20 07:25:14 INFO  SparkContext:54 - Created broadcast 0 from    broadcast at DAGScheduler.scala:1161
  2018-12-20 07:25:14 INFO  DAGScheduler:54 - Submitting 1000 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /opt/spark/examples/src/main/python/pi.py:44) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   2018-12-20 07:25:14 INFO  YarnScheduler:54 - Adding task set 0.0 with 1000 tasks
   2018-12-20 07:25:14 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, hadoop-slave2, executor 1, partition 0, PROCESS_LOCAL, 7863 bytes)
   2018-12-20 07:25:14 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, hadoop-slave1, executor 2, partition 1, PROCESS_LOCAL, 7863 bytes)
  2018-12-20 07:25:15 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on hadoop-slave2:37217 (size: 4.2 KB, free: 93.3 MB)
  2018-12-20 07:25:15 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on hadoop-slave1:35311 (size: 4.2 KB, free: 93.3 MB)
2018-12-20 07:25:15 INFO  TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, hadoop-slave2, executor 1, partition 2, PROCESS_LOCAL, 7863 bytes)
2018-12-20 07:25:15 INFO  TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, hadoop-slave1, executor 2, partition 3, PROCESS_LOCAL, 7863 bytes)
2018-12-20 07:25:16 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, hadoop-slave2, executor 1): org.apache.spark.SparkException: 
Error from python worker:
Traceback (most recent call last):
File "/usr/lib64/python2.6/runpy.py", line 104, in _run_module_as_main
  loader, code, fname = _get_module_details(mod_name)
File "/usr/lib64/python2.6/runpy.py", line 79, in _get_module_details
  loader = get_loader(mod_name)
File "/usr/lib64/python2.6/pkgutil.py", line 456, in get_loader
  return find_loader(fullname)
File "/usr/lib64/python2.6/pkgutil.py", line 466, in find_loader
  for importer in iter_importers(fullname):
File "/usr/lib64/python2.6/pkgutil.py", line 422, in iter_importers
  __import__(pkg)
 File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/__init__.py", line 51, in <module>
File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/context.py", line 31, in <module>
File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/accumulators.py", line 97, in <module>
File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 71, in <module>
File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 246, in <module>
File "/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 270, in CloudPickler
 NameError: name 'memoryview' is not defined
 PYTHONPATH was:
 /tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/filecache/21/__spark_libs__3793296165132209773.zip/spark-core_2.11-2.4.0.jar:    /tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip:/tmp/hadoop-hdfs/nm-local-dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/py4j-0.10.7-src.zip
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
at   org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at         org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
2018-12-20 07:25:14信息SparkContext:54-从DAGScheduler的广播创建广播0。scala:1161
2018-12-20 07:25:14信息调度程序:54-从ResultStage 0提交1000个缺少的任务(位于reduce at/opt/spark/examples/src/main/python/pi.py:44的PythonRDD[1])(前15个任务用于分区向量(0、1、2、3、4、5、6、7、8、9、10、11、12、13、14))
2018-12-20 07:25:14信息源计划程序:54-添加任务集0.0和1000个任务
2018-12-20 07:25:14信息任务集管理器:54-在阶段0.0中启动任务0.0(TID 0,hadoop-slave2,执行器1,分区0,进程_本地,7863字节)
2018-12-20 07:25:14信息任务集管理器:54-在0.0阶段启动任务1.0(TID 1,hadoop-slave1,执行器2,分区1,进程_本地,7863字节)
2018-12-20 07:25:15信息块管理信息:54-在hadoop-slave2:37217上的内存中添加了广播片段(大小:4.2 KB,可用空间:93.3 MB)
2018-12-20 07:25:15信息块管理信息:54-在hadoop-slave1:35311上的内存中添加了广播片段(大小:4.2 KB,免费:93.3 MB)
2018-12-20 07:25:15信息任务集管理器:54-在0.0阶段启动任务2.0(TID 2,hadoop-slave2,执行器1,分区2,进程_本地,7863字节)
2018-12-20 07:25:15信息任务集管理器:54-在0.0阶段启动任务3.0(TID 3,hadoop-slave1,执行器2,分区3,进程_本地,7863字节)
2018-12-20 07:25:16警告TaskSetManager:66-在0.0阶段丢失任务0.0(TID 0,hadoop-slave2,executor 1):org.apache.spark.sparkeException:
python工作程序出错:
回溯(最近一次呼叫最后一次):
文件“/usr/lib64/python2.6/runpy.py”,第104行,在运行模块中
加载程序,代码,fname=\u获取\u模块\u详细信息(mod\u名称)
文件“/usr/lib64/python2.6/runpy.py”,第79行,在获取模块详细信息中
加载器=获取加载器(模块名称)
get_loader中的文件“/usr/lib64/python2.6/pkgutil.py”,第456行
返回查找加载器(全名)
文件“/usr/lib64/python2.6/pkgutil.py”,第466行,在find_loader中
iter_进口商(全名):
iter_中的文件“/usr/lib64/python2.6/pkgutil.py”,第422行
__进口(包装)
文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/uuuu init_uuu.py”,第51行,in
文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/context.py”,第31行
文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/acumerators.py”,第97行,in
文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/serializers.py”,第71行,in
文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/cloudpickle.py”,第246行,in
CloudPickler中的文件“/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip/pyspark/cloudpickle.py”,第270行
名称错误:未定义名称“memoryview”
蟒蛇是:
/tmp/hadoop hdfs/nm local dir/usercache/hdfs/filecache/21/_spark_libs__3793296165132209773.zip/spark-core_2.11-2.4.0.jar:/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/pyspark.zip:/tmp/hadoop hdfs/nm local dir/usercache/hdfs/appcache/application_1545288386209_0005/container_1545288386209_0005_01_000002/py4j-0.10.7-src.zip
org.apache.spark.SparkException:pyspark.daemon的stdout中没有端口号
位于org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
位于org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
位于org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
位于org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
位于org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
位于org.apache.spark.scheduler.Task.run(Task.scala:121)

我认为,当Python版本不匹配时,就会出现此问题

将以下内容添加到我的~/.bash_个人资料中对我很有用:

alias spark-submit='PYSPARK_PYTHON=$(which python) spark-submit'

它应该强制spark使用您在模块中加载的相同版本的python

这修复了pyspark.daemon的标准输出中令人恼火的隐藏的
错误数据。无效端口号:
error我在spark-env.sh中设置PYSPARK_PYTHON=$(哪个PYTHON)应该工作。我想我们不必屈服。