Google cloud platform Dataproc覆盖执行器内存
我们曾经使用以下参数在Hadoop集群上运行spark作业:Google cloud platform Dataproc覆盖执行器内存,google-cloud-platform,airflow,google-cloud-dataproc,Google Cloud Platform,Airflow,Google Cloud Dataproc,我们曾经使用以下参数在Hadoop集群上运行spark作业: { 'conn_id': 'spark_default', 'num_executors': 10, 'executor_cores': 4, 'executor_memory': '15G', 'driver_memory': '8G', 'conf': { 'spark.yarn.executor.memoryOverhead': '10G'
{
'conn_id': 'spark_default',
'num_executors': 10,
'executor_cores': 4,
'executor_memory': '15G',
'driver_memory': '8G',
'conf': {
'spark.yarn.executor.memoryOverhead': '10G'
}
}
我们现在正在将作业移动到Dataproc,无法复制相同的配置:
我们设置了一个集群,以获得足够的vCPU和内存
create_cluster=dataproc_operator.DataprocClusterCreateOperator(
task_id='create-%s' % CLUSTER_NAME,
cluster_name=CLUSTER_NAME,
project_id=PROJECT_ID,
num_workers=2,
num_preemptible_workers=3,
num_masters=1,
master_machine_type='n1-highmem-8',
worker_machine_type='n1-highmem-8',
subnetwork_uri='projects/#####/regions/europe-west1/subnetworks/prod',
custom_image="spark-instance",
master_disk_size=50,
worker_disk_size=50,
storage_bucket=‘#####-dataproc-tmp',
region='europe-west1',
zone='europe-west1-b',
auto_delete_ttl=7200,
dag=dag
)
job = dataproc_operator.DataProcPySparkOperator(
task_id=TASK_ID,
project_id=PROJECT_ID,
cluster_name=CLUSTER_NAME,
region='europe-west1',
main='%s/dist/main.py' % FOLDER,
pyfiles='%s/dist/jobs.zip' % FOLDER,
dataproc_pyspark_properties=spark_args,
arguments=JOBS_ARGS,
dag=dag
)
使用
spark_args_powerplus = {
'num_executors': '10',
'executor_cores': '4',
'executor_memory': '15G',
'executor_memoryoverhead': '10G'
}
似乎未考虑执行器的内存溢出,导致作业失败。Dataproc中是否缺少默认值?Dataproc不了解这些速记属性。例如,
num_executors
应改为spark.executor.instances
。
您是否可以尝试将以下内容作为dataproc\u pyspark\u属性传递
spark\u args\u powerplus={
'spark.executor.instances':'10',
“spark.executor.cores”:“4”,
“spark.executor.memory”:“15G”,
“spark.executor.memoryOverhead”:“10G”
}