Google cloud platform 如何在Dataproc上的提交作业函数中包含jar URI
我正试图通过jupyter运行PySpark作业,我需要创建一个函数来运行该作业。我需要传递一个jar文件,我正试图找出如何做到这一点。 我确实找到了一些关于它的文档: 但我无法准确地理解如何将URI添加到函数中。我的函数当前看起来如下所示:Google cloud platform 如何在Dataproc上的提交作业函数中包含jar URI,google-cloud-platform,pyspark,google-cloud-dataproc,jupyterhub,Google Cloud Platform,Pyspark,Google Cloud Dataproc,Jupyterhub,我正试图通过jupyter运行PySpark作业,我需要创建一个函数来运行该作业。我需要传递一个jar文件,我正试图找出如何做到这一点。 我确实找到了一些关于它的文档: 但我无法准确地理解如何将URI添加到函数中。我的函数当前看起来如下所示: from google.cloud import dataproc_v1 def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_nam
from google.cloud import dataproc_v1
def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
filename):
"""Submit the Pyspark job to the cluster (assumes `filename` was uploaded
to `bucket_name."""
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'jar_file_uris':'gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar', #PROBLEM HERE!
'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename)
}
}
result = dataproc_cluster_client.submit_job(
project_id=project, region=region, job=job_details)
job_id = result.reference.job_id
print('Submitted job ID {}.'.format(job_id))
return job_id
问题在于job details参数的jar_file_uris部分。目前,我遇到了一个错误。因此我找到了修复方法。该函数应声明为:
def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
filename):
"""Submit the Pyspark job to the cluster (assumes `filename` was uploaded
to `bucket_name."""
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename),
'jar_file_uris':['gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar']
}
}
result = dataproc_cluster_client.submit_job(
project_id=project, region=region, job=job_details)
job_id = result.reference.job_id
print('Submitted job ID {}.'.format(job_id))
return job_id
URI需要作为数组而不是字符串传递。这解决了问题