Apache spark 以纱线集群模式将文件传递到spark中的应用程序jar_Apache Spark

Apache spark 以纱线集群模式将文件传递到spark中的应用程序jar

apache-spark

Apache spark 以纱线集群模式将文件传递到spark中的应用程序jar,apache-spark,Apache Spark,我使用以下命令以集群模式部署spark应用程序 spark-submit --master yarn --deploy-mode cluster --class com.rocai.controller.Controller --jars <absolute-path-to-ojdbc6.jar> --driver-memory 1g --executor-memory 1g --num-executors 2 --executor-cores 2 <absolute-path

我使用以下命令以集群模式部署spark应用程序

spark-submit --master yarn --deploy-mode cluster --class com.rocai.controller.Controller --jars <absolute-path-to-ojdbc6.jar> --driver-memory 1g --executor-memory 1g --num-executors 2 --executor-cores 2 <absolute-path-to-app.jar> <absolute-path-to-controller.xml>

spark提交--主线程--部署模式集群--类com.rocai.controller.controller--jars--驱动程序内存1g--执行器内存1g--num executors 2--执行器核心2

controller.xml是app.jar的一个参数。对于controller.xml文件，我总是会遇到一个file not found异常。我甚至尝试使用--files标记传递controller.xml文件，如下所示

spark-submit --master yarn --deploy-mode cluster --class com.rocai.controller.Controller --jars <absolute-path-to-ojdbc6.jar> --driver-memory 1g --executor-memory 1g --num-executors 2 --executor-cores 2 <absolute-path-to-app.jar> <absolute-path-to-controller.xml> --files <absolute-path-to-controller.xml>

spark提交--主线程--部署模式集群--类com.rocai.controller.controller--jars--驱动程序内存1g--执行器内存1g--num executors 2--执行器核心2--文件

这可能是因为controller.xml文件没有上载到应用程序容器。据我所知，纱线集群模式下的驱动程序进程将在集群中的任意节点启动。查看日志，我看到app.jar、ojdbc6.jar、hadoop_conf.zip和spark-assembly.jar正在被上传到容器中。如何确保controller.xml文件也上载到纱线容器

我可能在这里误解了一些事情，所以任何帮助都会非常感激

谢谢

根据Spark的说法，只要每个节点都有文件的副本，并且在相同的绝对路径中，您的应用程序就可以在本地打开文件

至于在提交应用程序时上载文件，我认为在提交应用程序时必须在jar之前传递

--files

参数，因此如下所示：

spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.rocai.controller.Controller \
--jars <absolute-path-to-ojdbc6.jar> \
--driver-memory 1g \
--executor-memory 1g \
--num-executors 2 \
--executor-cores 2 \
--files <absolute-path-to-controller.xml> \
<absolute-path-to-app.jar> <absolute-path-to-controller.xml>

spark提交\
--母纱\
--部署模式群集\
--类com.rocai.controller.controller\
--jars类，尽管我自己还没用过
另一种解决方法是手动将控制器xml上传到HDFS上的固定路径。
根据Spark，如果每个节点都有文件的副本，并且在相同的绝对路径中，您的应用程序可以在本地打开文件
至于在提交应用程序时上载文件，我认为在提交应用程序时必须在jar之前传递--files
参数，因此如下所示：
spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.rocai.controller.Controller \
--jars <absolute-path-to-ojdbc6.jar> \
--driver-memory 1g \
--executor-memory 1g \
--num-executors 2 \
--executor-cores 2 \
--files <absolute-path-to-controller.xml> \
<absolute-path-to-app.jar> <absolute-path-to-controller.xml> 

spark提交\
--母纱\
--部署模式群集\
--类com.rocai.controller.controller\
--jars类，尽管我自己还没用过
另一种解决方法是手动将控制器xml上传到HDFS上的固定路径。
仅供参考，下面是我用来执行spark作业的命令
spark-submit \
--class com.rocai.controller.Controller \ 
--master yarn \
--deploy-mode cluster \
--jars /usr/hdp/current/spark-client/ojdbc6.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar \
--files controller.xml,/usr/hdp/current/spark-client/conf/hive-site.xml \
--driver-memory 1g --executor-memory 1G --num-executors 2 --executor-cores 1 \
app.jar \
controller.xml

似乎有必要包含datanucleus JAR和hive-site.xml，以避免出现“未找到类”异常。还要确保逗号分隔的值之间没有空格。
仅供参考，下面是我用来执行spark作业的命令
spark-submit \
--class com.rocai.controller.Controller \ 
--master yarn \
--deploy-mode cluster \
--jars /usr/hdp/current/spark-client/ojdbc6.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,\
/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar \
--files controller.xml,/usr/hdp/current/spark-client/conf/hive-site.xml \
--driver-memory 1g --executor-memory 1G --num-executors 2 --executor-cores 1 \
app.jar \
controller.xml

似乎有必要包含datanucleus JAR和hive-site.xml，以避免出现“未找到类”异常。另外，请确保逗号分隔的值之间没有空格。
作为一种可能的WA，您可以将.xml
文件打包为jar，并通过资源位置引用它。这将涉及每次我有一个新的控制器文件时重新生成app.jar，这对我来说是不可行的。不过，Thread客户端模式工作得非常好。我想知道这是不是一个在纱线簇模式下spark submit无法使用的功能。无需重建。您可以将它作为另一个jar添加到类路径中。作为可能的WA，您可以将.xml
文件打包为jar，并通过资源位置引用它。这将涉及每次我有一个新的控制器文件时重建app.jar，这对我来说是不可行的。不过，Thread客户端模式工作得非常好。我想知道这是不是一个在纱线簇模式下spark submit无法使用的功能。无需重建。您可以将它作为另一个jar添加到类路径中。感谢您指出了参数传递到spark submit的顺序！我已经在使用--files参数，但是在传递了app.jar参数之后。现在，所需的文件将上载到应用程序容器中。不过，我并没有将controller.xml文件的绝对路径传递给应用程序jar，只是传递名称。SparkContext将获取所需的文件，因为所有资源都与应用程序jar位于同一容器中。很高兴提供帮助。我以前也遇到过这个问题。在我的例子中，Spark无法找到我的Hive元存储，问题是我通过--files
参数传递了Hive site.xml
，但在jar之后。感谢您指出了参数传递到Spark submit的顺序！我已经在使用--files参数，但是在传递了app.jar参数之后。现在，所需的文件将上载到应用程序容器中。不过，我并没有将controller.xml文件的绝对路径传递给应用程序jar，只是传递名称。SparkContext将获取所需的文件，因为所有资源都与应用程序jar位于同一容器中。很高兴提供帮助。我以前也遇到过这个问题。在我的例子中，Spark无法找到我的Hive元存储，问题是我通过--files
参数传递Hive site.xml
，但在jar之后。