Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/google-cloud-platform/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google cloud platform 尝试使用PySpark读取Dataproc工作流中的BigQuery表时出现问题_Google Cloud Platform_Pyspark_Google Cloud Dataproc - Fatal编程技术网

Google cloud platform 尝试使用PySpark读取Dataproc工作流中的BigQuery表时出现问题

Google cloud platform 尝试使用PySpark读取Dataproc工作流中的BigQuery表时出现问题,google-cloud-platform,pyspark,google-cloud-dataproc,Google Cloud Platform,Pyspark,Google Cloud Dataproc,我正在尝试使用GCP+Dataproc+PySpark自动化一个流程。为此,我创建了以下脚本: data_project = project_name data_pop_table = dataset_name.table_name spark = SparkSession\ .builder\ .master('local[*]')\ .appName('workflow_segmenta

我正在尝试使用GCP+Dataproc+PySpark自动化一个流程。为此,我创建了以下脚本:

data_project = project_name
data_pop_table = dataset_name.table_name

spark = SparkSession\             
             .builder\
             .master('local[*]')\
             .appName('workflow_segmentation')\
             .config('spark.local.dir', '/dev/spark')\
             .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.2")\
             .getOrCreate()

data = spark.read\
            .format('com.google.cloud.spark.bigquery')\
            .option("project", data_project)\
            .option("table", data_pop_table)\
            .load()
此脚本由使用以下bash脚本创建的Dataproc工作流使用:

#Creating the job
gcloud dataproc workflow-templates create dataproc_job_name \
    --region=us-central1

#Setting up the job (selecting Python version & the source code to run)
gcloud dataproc workflow-templates add-job pyspark file:///root/folder/main.py \
    --workflow-template=dataproc_job_name \
    --step-id=id_1 \
    --region=us-central1

#Setting up the VM
gcloud dataproc workflow-templates set-managed-cluster dataproc_job_name \
    --cluster-name=automatic-dataproc-job \
    --single-node \
    --master-machine-type=n1-standard-32 \
    --image-version=1.4 \
    --region=us-central1 \
    --scopes cloud-platform \
    --metadata='PIP_PACKAGES=pandas numpy matplotlib google-cloud-storage' \
    --initialization-actions=gs://datastudio_ds/automations-prod/config_files/pip_install.sh
但是,当我运行DataProc作业时,我得到以下错误:

Traceback (most recent call last):
  File "/root/folder/main.py", line 16, in <module>
    fill_as_preprocessing=True)
  File "/root/folder/main.py", line 760, in data_adecuation
    .option("table",self.data_pop_table)\
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o643.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.google.cloud.spark.bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
回溯(最近一次呼叫最后一次):
文件“/root/folder/main.py”,第16行,在
填充为(预处理=真)
文件“/root/folder/main.py”,第760行,数据解析
.option(“表”,自数据\u pop\u表)\
文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py”,第172行,装入
文件“/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”,第63行,deco格式
文件“/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get_return_值中
py4j.protocol.Py4JJavaError:调用o643.load时出错。
:java.lang.ClassNotFoundException:未能找到数据源:com.google.cloud.spark.bigquery。请在以下网址查找包裹:http://spark.apache.org/third-party-projects.html

我不知道为什么会出现这个错误。事实上,我在DataProc集群中运行了相同的脚本,效果很好。如果有人曾经遇到过这个问题,或者知道如何解决它,我将非常感激

为了完成此任务,可以通过在addjob命令中设置--jar标志来解决此问题。--jar标志必须指定包含BigQuery的Spark连接器的.jar文件的路径。下一步是创建Dataproc作业的正确bash脚本:

#Creating the job
gcloud dataproc workflow-templates create dataproc_job_name \
    --region=us-central1

#Setting up the job (selecting Python version & the source code to run)
gcloud dataproc workflow-templates add-job pyspark file:///root/folder/main.py \
    --workflow-template=dataproc_job_name \
    --step-id=id_1 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar \
    --region=us-central1

#Setting up the VM
gcloud dataproc workflow-templates set-managed-cluster dataproc_job_name \
    --cluster-name=automatic-dataproc-job \
    --single-node \
    --master-machine-type=n1-standard-32 \
    --image-version=1.4 \
    --region=us-central1 \
    --scopes cloud-platform \
    --metadata='PIP_PACKAGES=pandas numpy matplotlib google-cloud-storage' \
    --initialization-actions=gs://datastudio_ds/automations-prod/config_files/pip_install.sh
请用你的问题标题更清楚地解释你的问题。“problemwith”后面跟一系列标记在任何方面都不是有用的或描述性的。你的标题应该足够清晰,以便将来的网站用户能够浏览搜索结果列表,试图为他们的问题找到解决方案,而你当前的标题不包含任何有助于这方面的细节。谢谢