Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用依赖项运行PySpark脚本_Python_Apache Spark_Pyspark_Yarn - Fatal编程技术网

Python 使用依赖项运行PySpark脚本

Python 使用依赖项运行PySpark脚本,python,apache-spark,pyspark,yarn,Python,Apache Spark,Pyspark,Yarn,在带有Spark 2.4.0和Thread的cdh6.2.0集群上,我正尝试使用PySpark提交一个Python脚本(将来我需要让Oozie提交)。集群节点是异构的(有些有CentOS,有些有Debian),Python的安装也不一样,但每个节点上都有Pyton3.6安装,即使在不同的操作系统上有不同的路径。Python脚本有一些外部依赖性,但我无法在每个节点上复制相同的配置,即使在每个节点上使用Conda或配置临时虚拟环境 为了解决我的问题,我发现有两种主要的方法可以解决这种情况: i提交

在带有Spark 2.4.0和Thread的cdh6.2.0集群上,我正尝试使用PySpark提交一个Python脚本(将来我需要让Oozie提交)。集群节点是异构的(有些有CentOS,有些有Debian),Python的安装也不一样,但每个节点上都有Pyton3.6安装,即使在不同的操作系统上有不同的路径。Python脚本有一些外部依赖性,但我无法在每个节点上复制相同的配置,即使在每个节点上使用Conda或配置临时虚拟环境

为了解决我的问题,我发现有两种主要的方法可以解决这种情况:

  • i提交脚本时,在执行节点上显示一个隔离的虚拟环境,提供一个
    requirements.txt
    文件,并设置
    spark.pyspark.virtualenv.enabled
    spark.pyspark.virtualenv.type
  • 使用
    --py files
    --archives
    选项分发依赖项
我试图使用这两种解决方案,但都失败了

我尝试提交的一个简单脚本示例(仅导入fieldclimate包以测试其是否有效):

这是
requirements.txt
文件的一个示例:

asks==2.3.6
async-generator==1.10
contextvars==2.4
h11==0.9.0
immutables==0.11
numpy==1.17.4
pandas==0.25.3
pycryptodome==3.9.4
python-dateutil==2.8.1
python-fieldclimate==1.3
pytz==2019.3
six==1.13.0
sniffio==1.1.0
作为第一种方法,我尝试使用一个虚拟环境。这就是我使用的命令:
spark submit--master swark--deploy模式客户端--conf spark.pyspark.virtualenv.enabled=true--conf spark.pyspark.virtualenv.type=native--conf spark.pyspark.virtualenv.path=/home/nsantolini/test_pyspark/sensors/virtual_env--confpyspark.pyspark.python=`which python3.6`script\u climate.py

我尝试了不同的配置组合,但导入不起作用:

Traceback (most recent call last):
  File "/home/nsantolini/test_pyspark/sensors/script_climate.py", line 12, in <module>
    from fieldclimate import FieldClimateClient
ModuleNotFoundError: No module named 'fieldclimate'
19/11/22 15:51:10 INFO util.ShutdownHookManager: Shutdown hook called
19/11/22 15:51:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7caa321f-ebe3-42b4-98de-3984946a2534
制作:

Traceback (most recent call last):
  File "/home/nsantolini/test_pyspark/script_climate.py", line 12, in <module>
    from fieldclimate import FieldClimateClient
ModuleNotFoundError: No module named 'fieldclimate'
19/11/22 16:29:16 INFO util.ShutdownHookManager: Shutdown hook called
19/11/22 16:29:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a8f2c3d8-1440-4741-81b9-8480316ea0e6
生产(来自纱线原木):

******************************************************************************
容器:my_主机_8041上的容器_1572511978601_0446_01_000001
LogAggregationType:已聚合
======================================================================================
日志类型:stderr
LogLastModifiedTime:Fri Nov 22 16:24:41+0100 2019
对数长度:2971
日志内容:
19/11/22 16:24:37 INFO util.SignalUtils:术语的注册信号处理程序
19/11/22 16:24:37 INFO util.SignalUtils:HUP的注册信号处理程序
19/11/22 16:24:37 INFO util.SignalUtils:INT的注册信号处理程序
19/11/22 16:24:38 INFO spark.SecurityManager:将视图ACL更改为:warn,nsantolini
19/11/22 16:24:38 INFO spark.SecurityManager:将修改ACL更改为:warn,nsantolini
19/11/22 16:24:38 INFO spark.SecurityManager:将视图ACL组更改为:
19/11/22 16:24:38 INFO spark.SecurityManager:将修改ACL组更改为:
19/11/22 16:24:38 INFO spark.SecurityManager:SecurityManager:身份验证已禁用;ui ACL被禁用;具有查看权限的用户:Set(纱线,nsantolini);具有查看权限的组:Set();具有修改权限的用户:Set(纱线,nsantolini);具有修改权限的组:Set()
19/11/22 16:24:38 INFO.ApplicationMaster:applicationattentide:appattendt\u 1572511978601\u 0446\uu000001
19/11/22 16:24:38 INFO.ApplicationMaster:在单独的线程中启动用户应用程序
19/11/22 16:24:38 INFO.ApplicationMaster:正在等待spark上下文初始化。。。
19/11/22 16:24:38错误。应用程序管理员:用户应用程序已退出,状态为1
19/11/22 16:24:38 INFO.ApplicationMaster:最终应用程序状态:失败,退出代码:13,(原因:用户应用程序以状态1退出)
19/11/22 16:24:38错误。应用程序管理员:未捕获异常:
org.apache.spark.SparkException:结果中引发的异常:
位于org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
位于org.apache.spark.deploy.warn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)
位于org.apache.spark.deploy.warn.ApplicationMaster.run(ApplicationMaster.scala:275)
位于org.apache.spark.deploy.warn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:799)
位于org.apache.spark.deploy.warn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:798)
位于java.security.AccessController.doPrivileged(本机方法)
位于javax.security.auth.Subject.doAs(Subject.java:422)
位于org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
位于org.apache.spark.deploy.warn.ApplicationMaster$.main(ApplicationMaster.scala:798)
位于org.apache.spark.deploy.warn.ApplicationMaster.main(ApplicationMaster.scala)
原因:org.apache.spark.SparkUserAppException:用户应用程序已退出,返回1
位于org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:106)
位于org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:497)
位于org.apache.spark.deploy.warn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:667)
19/11/22 16:24:38 INFO util.ShutdownHookManager:调用了关闭挂钩
日志类型结束:stderr
***********************************************************************
容器:my_主机_8041上的容器_1572511978601_0446_01_000001
LogAggregationType:已聚合
======================================================================================
日志类型:stdout
LogLastModifiedTime:Fri Nov 22 16:24:41+0100 2019
对数长度:185
日志内容:
回溯(最近一次呼叫最后一次):
文件“script_climate.py”,第3行,在
从pyspark导入SparkConf
zipimport.ZipImportError:无法解压缩数据;zlib不可用
日志类型结束:标准输出
*******
 PYSPARK_DRIVER_PYTHON=`which python3.6` \
PYSPARK_PYTHON=./virtual_env/bin/python3.6 \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./virtual_env/bin/python3.6 \
--master yarn \
--deploy-mode client \
--archives virtual_env.zip#virtual_env \
script_climate.py
Traceback (most recent call last):
  File "/home/nsantolini/test_pyspark/script_climate.py", line 12, in <module>
    from fieldclimate import FieldClimateClient
ModuleNotFoundError: No module named 'fieldclimate'
19/11/22 16:29:16 INFO util.ShutdownHookManager: Shutdown hook called
19/11/22 16:29:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a8f2c3d8-1440-4741-81b9-8480316ea0e6
PYSPARK_PYTHON=./virtual_env/bin/python3.6 \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./virtual_env/bin/python3.6 \
--master yarn \
--deploy-mode cluster \
--archives virtual_env.zip#virtual_env \
script_climate.py
******************************************************************************

Container: container_1572511978601_0446_01_000001 on my_host_8041
LogAggregationType: AGGREGATED
======================================================================================
LogType:stderr
LogLastModifiedTime:Fri Nov 22 16:24:41 +0100 2019
LogLength:2971
LogContents:
19/11/22 16:24:37 INFO util.SignalUtils: Registered signal handler for TERM
19/11/22 16:24:37 INFO util.SignalUtils: Registered signal handler for HUP
19/11/22 16:24:37 INFO util.SignalUtils: Registered signal handler for INT
19/11/22 16:24:38 INFO spark.SecurityManager: Changing view acls to: yarn,nsantolini
19/11/22 16:24:38 INFO spark.SecurityManager: Changing modify acls to: yarn,nsantolini
19/11/22 16:24:38 INFO spark.SecurityManager: Changing view acls groups to: 
19/11/22 16:24:38 INFO spark.SecurityManager: Changing modify acls groups to: 
19/11/22 16:24:38 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, nsantolini); groups with view permissions: Set(); users  with modify permissions: Set(yarn, nsantolini); groups with modify permissions: Set()
19/11/22 16:24:38 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1572511978601_0446_000001
19/11/22 16:24:38 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
19/11/22 16:24:38 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
19/11/22 16:24:38 ERROR yarn.ApplicationMaster: User application exited with status 1
19/11/22 16:24:38 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1)
19/11/22 16:24:38 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:275)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:799)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:798)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:798)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:106)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:667)
19/11/22 16:24:38 INFO util.ShutdownHookManager: Shutdown hook called

End of LogType:stderr
***********************************************************************

Container: container_1572511978601_0446_01_000001 on my_host_8041
LogAggregationType: AGGREGATED
======================================================================================
LogType:stdout
LogLastModifiedTime:Fri Nov 22 16:24:41 +0100 2019
LogLength:185
LogContents:
Traceback (most recent call last):
  File "script_climate.py", line 3, in <module>
    from pyspark import SparkConf
zipimport.ZipImportError: can't decompress data; zlib not available

End of LogType:stdout
***********************************************************************