Python 3.x &引用;pyarrow.lib.ArrowInvalid:从时间戳[ns]转换到时间戳[ms]将丢失数据;将数据发送到没有架构的BigQuery时

Python 3.x &引用;pyarrow.lib.ArrowInvalid:从时间戳[ns]转换到时间戳[ms]将丢失数据;将数据发送到没有架构的BigQuery时,python-3.x,google-bigquery,google-cloud-functions,pyarrow,Python 3.x,Google Bigquery,Google Cloud Functions,Pyarrow,我正在编写一个脚本,向BigQuery发送一个数据帧: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to complete return load_job.result() 这很好,但只有在BigQuery中已经定义了模式,或者我正在脚本中定义作业的模式时,才可以。如果未定义架构,则

我正在编写一个脚本,向BigQuery发送一个数据帧:

load_job = bq_client.load_table_from_dataframe(
    df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)

# Wait for the load job to complete
return load_job.result() 
这很好,但只有在BigQuery中已经定义了模式,或者我正在脚本中定义作业的模式时,才可以。如果未定义架构,则出现以下错误:

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'

为什么
pyarrow
会生成此错误?除了预先定义模式,我如何解决它

从pandas转换为Arrow或Parquet时的默认行为是不允许无声数据丢失。在执行转换时,可以设置一些选项,以允许不安全的强制转换导致时间戳精度损失或其他形式的数据丢失。BigQuery Python API需要设置这些选项,因此它可能是BigQuery库中的一个bug。我建议报告他们的问题跟踪器

我认为出现这些错误是因为BigQuery库使用的pyarrow.parquet模块确实将Python内置的日期时间或时间类型转换为BigQuery默认识别的类型,但BigQuery库确实有自己的方法来转换熊猫类型

通过将datetime.datetime或time.time的所有实例更改为pandas.Timestamp,我可以让它上传时间戳。例如:

my_df['timestamp']=datetime.utcnow()
将需要更改为

my_df['timestamp']=pd.timestamp.now()

我在尝试将一个查询从bigquery转换为_dataframe时也遇到了类似的错误,关于bigquery的文档很少<代码>箭头无效:从时间戳[us,tz=UTC]转换为时间戳[ns]将导致时间戳越界:-6153548800000