Shell 从Dataproc群集执行spark作业时,执行器心跳在125009毫秒后超时

Shell 从Dataproc群集执行spark作业时,执行器心跳在125009毫秒后超时,shell,apache-spark,google-cloud-platform,pyspark,google-cloud-dataproc,Shell,Apache Spark,Google Cloud Platform,Pyspark,Google Cloud Dataproc,下面是我如何创建我的dataproc集群,在制定属性时,我通过分配3600来处理网络超时,但尽管如此,执行器的心跳在125009ms后超时。为什么会发生这种情况?可以采取哪些措施来避免这种情况 default_parallelism=512 PROPERTIES="\ spark:spark.executor.cores=2,\ spark:spark.executor.memory=8g,\ spark:spark.executor.memoryOverhead=2g,\ spark:spa

下面是我如何创建我的dataproc集群,在制定属性时,我通过分配3600来处理网络超时,但尽管如此,执行器的心跳在125009ms后超时。为什么会发生这种情况?可以采取哪些措施来避免这种情况

default_parallelism=512

PROPERTIES="\
spark:spark.executor.cores=2,\
spark:spark.executor.memory=8g,\
spark:spark.executor.memoryOverhead=2g,\
spark:spark.driver.memory=6g,\
spark:spark.driver.maxResultSize=6g,\
spark:spark.kryoserializer.buffer=128m,\
spark:spark.kryoserializer.buffer.max=1024m,\
spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark:spark.default.parallelism=${default_parallelism},\
spark:spark.rdd.compress=true,\
spark:spark.network.timeout=3600s,\
spark:spark.rpc.message.maxSize=256,\
spark:spark.io.compression.codec=snappy,\
spark:spark.shuffle.service.enabled=true,\
spark:spark.sql.shuffle.partitions=256,\
spark:spark.sql.files.ignoreCorruptFiles=true,\
yarn:yarn.nodemanager.resource.cpu-vcores=8,\
yarn:yarn.scheduler.minimum-allocation-vcores=2,\
yarn:yarn.scheduler.maximum-allocation-vcores=4,\
yarn:yarn.nodemanager.vmem-check-enabled=false,\
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
  "

gcloud beta dataproc clusters create $CLUSTER_NAME  \
    --zone $ZONE \
    --region $REGION \
    --master-machine-type n1-standard-4 \
    --master-boot-disk-size 500 \
    --worker-machine-type n1-standard-4 \
    --worker-boot-disk-size 500 \
    --num-workers 3 \
    --bucket $GCS_BUCKET \
    --image-version 1.4-ubuntu18 \
    --optional-components=ANACONDA,JUPYTER \
    --subnet=default \
    --enable-component-gateway \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'
下面是我得到的错误:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 11, cluster-abc-z-2.c.project_name.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 125009 ms

您应该设置spark.executor.heartbeatInterval。它的默认值是10秒