Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Livy在120秒内未发现标记为Livy-batch-10-hg3po7kp的纱线应用_Apache Spark_Amazon S3_Amazon Emr_Livy - Fatal编程技术网

Apache spark Livy在120秒内未发现标记为Livy-batch-10-hg3po7kp的纱线应用

Apache spark Livy在120秒内未发现标记为Livy-batch-10-hg3po7kp的纱线应用,apache-spark,amazon-s3,amazon-emr,livy,Apache Spark,Amazon S3,Amazon Emr,Livy,使用Livy通过从EMR启动的POST请求执行存储在S3中的脚本。脚本运行,但超时非常快。我已经尝试过编辑livy.conf配置,但是没有一个改动能够持久。这是返回的错误: java.lang.Exception: No YARN application is found with tag livy-batch-10-hg3po7kp in 120 seconds. Please check your cluster status, it is may be very busy. org.apa

使用
Livy
通过从EMR启动的
POST
请求执行存储在
S3
中的脚本。脚本运行,但超时非常快。我已经尝试过编辑livy.conf配置,但是没有一个改动能够持久。这是返回的错误:

java.lang.Exception: No YARN application is found with tag livy-batch-10-hg3po7kp in 120 seconds. Please check your cluster status, it is may be very busy.
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236) scala.Option.getOrElse(Option.scala:121) org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236) org.apache.livy.Utils$$anon$1.run(Utils.scala:94)

这是一个需要解决的棘手问题,但我能够使用以下命令使其正常工作:

curl -X POST --data '{"proxyUser": "hadoop","file": "s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/hello.py", "jars": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/NQjc.jar"], "pyFiles": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/application.zip"], "archives": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/venv.zip#venv"], "driverMemory": "10g", "executorMemory": "10g", "name": "Name of Import Job here", "conf":{
"spark.yarn.appMasterEnv.SPARK_HOME": "/usr/lib/spark",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.requirements":"requirements.pip",
"spark.pyspark.virtualenv.bin.path": "virtualenv",
"spark.master": "yarn",
"spark.submit.deployMode": "cluster"}}' -H "Content-Type: application/json" http://MY-PATH--TO-MY--EMRCLUSTER:8998/batches
克隆包含应用程序文件的存储库后,在EMR群集的主节点上运行此脚本以设置依赖项后:

set -e
set -x

export HADOOP_CONF_DIR="/etc/hadoop/conf"
export PYTHON="/usr/bin/python3"
export SPARK_HOME="/usr/lib/spark"
export PATH="$SPARK_HOME/bin:$PATH"


# Set $PYTHON to the Python executable you want to create
# your virtual environment with. It could just be something
# like `python3`, if that's already on your $PATH, or it could
# be a /fully/qualified/path/to/python.
test -n "$PYTHON"

# Make sure $SPARK_HOME is on your $PATH so that `spark-submit`
# runs from the correct location.
test -n "$SPARK_HOME"

"$PYTHON" -m venv venv --copies
source venv/bin/activate
pip install -U pip
pip install -r requirements.pip
deactivate

# Here we package up an isolated environment that we'll ship to YARN.
# The awkward zip invocation for venv just creates nicer relative
# paths.
pushd venv/
zip -rq ../venv.zip *
popd

# Here it's important that application/ be zipped in this way so that
# Python knows how to load the module inside.
zip -rq application.zip application/
根据我在此提供的说明:

如果遇到任何问题,请在此处查看Livy日志:

/var/log/livy/livy-livy-server.out
以及Hadoop资源管理器UI中显示的日志,一旦您通过隧道进入EMR主节点并设置了web浏览器代理,您就可以从EMR控制台中的链接访问这些日志

此解决方案的一个关键方面是,由于此处提到的问题,当通过file、jars、pyFiles或archives参数提供文件时,Livy无法从本地主节点上载文件:

因此,我能够通过引用通过利用EMRFS上传到S3的文件来解决这个问题。此外,对于virtualenv(如果您使用的是PySpark),使用--copies参数非常重要,否则您将得到无法从HDFS使用的符号链接

使用virtualenv也存在一些问题,此处已报告: 这与PySpark相关(可能不适用于您),因此我需要通过添加其他参数来解决这些问题。这里还提到了其中一些:


无论如何,由于Livy在上传本地文件时遇到问题,直到我通过EMRFS引用S3中的文件来解决这个问题,Livy才会失败,因为它无法将文件上传到临时目录。另外,当我尝试在HDFS中提供绝对路径而不是使用S3时,因为HDFS资源属于hadoop用户,而不是livy用户,livy无法访问它们并将它们复制到临时目录以执行作业。因此,有必要通过EMRFS引用S3中的文件

解决方案是您必须检查SparkUtil.scala中的代码

GetOrCreate的配置应处于活动状态。否则,livy无法检查和关闭纱线的连接

例如:

val spark=SparkSession.builder().appName(appName).getOrCreate()
同样在我的例子中,我有一些行被评论过,这就是问题所在


如果您的回复格式(代码、链接等)稍加调整,那就太好了