BigQuery TypeError:to#pandas()获取了意外的关键字参数';时间戳作为对象'; 环境详情 操作系统类型和版本:1.5.29-debian10 Python版本:3.7 googlecloudbigqueryversion:2.8.0

BigQuery TypeError:to#pandas()获取了意外的关键字参数';时间戳作为对象'; 环境详情 操作系统类型和版本:1.5.29-debian10 Python版本:3.7 googlecloudbigqueryversion:2.8.0,python,pandas,google-bigquery,Python,Pandas,Google Bigquery,我正在设置一个dataproc集群,它将数据从BigQuery获取到一个dataframe中。 随着数据的增长,我希望提高性能,并听说了使用BigQuery存储客户端 我在过去也遇到过同样的问题,通过将google cloud bigquery设置为1.26.1版就解决了这个问题。 如果我使用该版本,我会收到以下消息 /opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWa

我正在设置一个dataproc集群,它将数据从BigQuery获取到一个dataframe中。 随着数据的增长,我希望提高性能,并听说了使用BigQuery存储客户端

我在过去也遇到过同样的问题,通过将google cloud bigquery设置为1.26.1版就解决了这个问题。 如果我使用该版本,我会收到以下消息

/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
 "Cannot create BigQuery Storage client, the dependency " 
代码段执行速度较慢。如果我没有指定pip版本,我会遇到这个错误

复制步骤
  • 在dataproc上创建集群
  • 在集群上执行以下脚本
  • 2021-02-11 09:21:19532-预处理记录器已初始化
    2021-02-11 09:21:19532-参数=[文件,arg1,arg2,arg3,arg4,项目id,arg5,arg6]
    起动
    
    下载:100%|██████████| 3107858/3107858[00:14Dataproc默认安装pyarrow 0.15.0,而bigquery存储api需要更新版本。安装时手动将pyarrow设置为3.0.0解决了此问题。 也就是说,PySpark具有Pyarrow>=0.15.0的兼容性设置
    我已经查看了dataproc的发行说明,这个env变量自2020年5月起被设置为默认值。

    @Sam回答了这个问题,但我想我只需要提到可操作的命令:

    在Jupyter笔记本中:

    !pip安装pyarrow==3.0.0

    在你的虚拟世界里


    pip install pyarrow==3.0.0

    可以确认
    !pip install pyarrow==3.0.0
    为我解决了它
    gcloud dataproc clusters create testing-cluster  --region=europe-west1  --zone=europe-west1-b  --master-machine-type n1-standard-16  --single-node  --image-version 1.5-debian10  --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh  --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'
    
    bqclient = bigquery.Client(project=project)
        job_config = bigquery.QueryJobConfig(
            query_parameters=[
                bigquery.ScalarQueryParameter("query_start", "STRING", str('2021-02-09 00:00:00')),
                bigquery.ScalarQueryParameter("query_end", "STRING", str('2021-02-09 23:59:59.99')),
            ]
        )
        df = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
    
    2021-02-11 10:10:14,069 - preprocessing logger initialized
    2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
    Traceback (most recent call last):
      File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>
        df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
      File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe
        date_as_object=date_as_object,
      File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
        df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
      File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
    TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
    
        query_config = {
            'query': {
                'parameterMode': 'NAMED',
                'queryParameters': [
                    {
                        'name': 'query_start',
                        'parameterType': {'type': 'STRING'},
                        'parameterValue': {'value': str('2021-02-09 00:00:00')}
                    },
                    {
                        'name': 'query_end',
                        'parameterType': {'type': 'STRING'},
                        'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
                    },
                ]
            }
        }
    df = pd.read_gbq(base_query, configuration=query_config, progress_bar_type='tqdm',
                                 use_bqstorage_api=True)
    
    2021-02-11 09:21:19,532 - preprocessing logger initialized
    2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
    started
    Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
    Traceback (most recent call last):
      File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
        use_bqstorage_api=True)
      File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
        **kwargs,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
        dtypes=dtypes,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
        user_dtypes=dtypes,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
        **to_dataframe_kwargs
      File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
        df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
      File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
    TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'