Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何将spark参数传递给dataproc工作流模板?_Apache Spark_Google Cloud Platform_Pyspark_Google Cloud Dataproc - Fatal编程技术网

Apache spark 如何将spark参数传递给dataproc工作流模板?

Apache spark 如何将spark参数传递给dataproc工作流模板?,apache-spark,google-cloud-platform,pyspark,google-cloud-dataproc,Apache Spark,Google Cloud Platform,Pyspark,Google Cloud Dataproc,以下是我所拥有的: gcloud dataproc workflow-templates create $TEMPLATE_ID --region $REGION gcloud beta dataproc workflow-templates set-managed-cluster $TEMPLATE_ID --region $REGION --cluster-name dailyhourlygtp --image-version 1.5 --master-machine-type=n1-s

以下是我所拥有的:

gcloud dataproc workflow-templates create $TEMPLATE_ID --region $REGION

gcloud beta dataproc workflow-templates set-managed-cluster $TEMPLATE_ID --region $REGION --cluster-name dailyhourlygtp --image-version 1.5 
--master-machine-type=n1-standard-8 --worker-machine-type=n1-standard-16 --num-workers=10 --master-boot-disk-size=500 
--worker-boot-disk-size=500 --zone=europe-west1-b


export STEP_ID=step_pyspark1

gcloud dataproc workflow-templates add-job pyspark \
gs://$BUCKET_NAME/my_pyscript.py \
--step-id $STEP_ID \
--workflow-template $TEMPLATE_ID \
--region $REGION \
--jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
--initialization-actions gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
--properties spark.jars.packages=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

gcloud dataproc workflow-templates instantiate $TEMPLATE_ID --region=$REGION
因此,这里的问题是如何将以下spark参数传递给my
my_pyscript.py

--master yarn     --deploy-mode cluster     --conf "spark.sql.shuffle.partitions=900" 
--conf "spark.sql.autoBroadcastJoinThreshold=10485760" --conf "spark.executor.memoryOverhead=8192" 
--conf "spark.dynamicAllocation.enabled=true" --conf "spark.shuffle.service.enabled=true" 
--executor-cores 5 --executor-memory 15g --driver-memory 16g

文档中对此进行了描述:

要配置PySpark的键值对列表。有关可用属性的列表,请参见:

因此,您可以在没有模板的情况下使用dataproc集群。提交作业的spark属性通过
--properties
参数作为键值数组传递

如果python作业需要参数,则在位置参数的右侧指定它们,并用空格分隔

例如,可以这样做:

gcloud dataproc工作流模板添加作业pyspark\
gs://$BUCKET\u NAME/my\u pyscript.py\
--步骤id$步骤id\
--工作流模板$template\u ID\
--地区$地区\
--properties=“spark.submit.deployMode”=“集群”\
“park.sql.shuffle.partitions”=“900”\
“spark.sql.autoBroadcastJoinThreshold”=“10485760”\
“spark.executor.memoryOverhead”=“8192”\
“spark.DynamicLocation.enabled”=“true”\
“spark.shuffle.service.enabled”=“true”\
“spark.executor.memory”=“15g”\
“spark.driver.memory”=“16g”\
“spark.executor.cores”=“5”\
--arg1 arg2#对于命名的arg:--arg1 arg1-arg2 arg2
--properties=[PROPERTY=VALUE,…]