Python BigQuery ASCII 0错误,即使仅包含ASCII字符

Python BigQuery ASCII 0错误,即使仅包含ASCII字符,python,pandas,google-bigquery,ascii,Python,Pandas,Google Bigquery,Ascii,我试图将pandas数据帧加载到bigquery表中,但出现ASCII 0错误。我知道事实上,文本只包含ascii字符,问题是只有一列(不清楚是哪一行)。我尝试了以下方法: 从API调用到BQ表的数据帧导致ASCII 0错误 将数据帧保存为google云存储中的csv(来自API调用),手动上载到BQ表会导致ASCII 0错误 从谷歌云存储下载csv到本地驱动器,手动上传到BQ表,结果出现ASCII 0错误 但是,如果我在步骤3中获取文件,重新保存该文件,然后手动上载到BQ表,它将毫无问题地上载

我试图将pandas数据帧加载到bigquery表中,但出现ASCII 0错误。我知道事实上,文本只包含ascii字符,问题是只有一列(不清楚是哪一行)。我尝试了以下方法:

  • 从API调用到BQ表的数据帧导致ASCII 0错误
  • 将数据帧保存为google云存储中的csv(来自API调用),手动上载到BQ表会导致ASCII 0错误
  • 从谷歌云存储下载csv到本地驱动器,手动上传到BQ表,结果出现ASCII 0错误
  • 但是,如果我在步骤3中获取文件,重新保存该文件,然后手动上载到BQ表,它将毫无问题地上载
  • 我不确定为什么第4步会起作用,而其他步骤则不起作用。数据框仅包含两列,因此不存在可能包含非ascii字符的其他列的问题。我已把范围缩小到罪犯一栏

    错误:

    0it [00:00, ?it/s]Traceback (most recent call last):
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 629, in load_data
        for remaining_rows in chunks:
      File "/opt/conda/default/lib/python3.7/site-packages/tqdm/std.py", line 1081, in __iter__
        for obj in iterable:
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/load.py", line 82, in load_chunks
        location=location,
      File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 734, in result
        return super(_AsyncJob, self).result(timeout=timeout)
      File "/opt/conda/default/lib/python3.7/site-packages/google/api_core/future/polling.py", line 134, in result
        raise self._exception
    google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Error detected while parsing row starting at position: 39540. Error: Bad character (ASCII 0) encountered.
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/tmp/document_download_20201223_dd87eec0/wd_download.py", line 268, in <module>
        obj.clean_data()
      File "/tmp/document_download_20201223_dd87eec0/wd_download.py", line 248, in clean_data
        self.res_df.to_gbq(datatable,project_id,if_exists='replace')
      File "/opt/conda/default/lib/python3.7/site-packages/pandas/core/frame.py", line 1657, in to_gbq
        credentials=credentials,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 228, in to_gbq
        private_key=private_key,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 1210, in to_gbq
        progress_bar=progress_bar,
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 636, in load_data
        self.process_http_error(ex)
      File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 435, in process_http_error
        raise GenericGBQException("Reason: {0}".format(ex))
    pandas_gbq.gbq.GenericGBQException: Reason: 400 Error while reading data, error message: Error detected while parsing row starting at position: 39540. Error: Bad character (ASCII 0) encountered.
    
    0it [00:02, ?it/s]
    Job output is complete 
    

    如果你有两个文件-正确和错误-那么你可以使用一些程序来比较它们。或者您可以编写脚本来比较字节值并显示它们——这样可能会显示出是什么值造成了问题。可能当它保存了错误的文件,然后添加了使用值的BOM(字节顺序标记)
    \xff
    \xfe
    谢谢,我尝试了这个方法,但两个文件完全相同。我不认为BOM是个问题,因为即使尝试将内存中的DF加载到BQ(即,它在任何时候都不会被保存)时,这个问题仍然存在。如果你有两个文件-正确和错误-那么你可以使用一些程序来比较它们。或者您可以编写脚本来比较字节值并显示它们——这样可能会显示出是什么值造成了问题。可能当它保存了错误的文件,然后添加了使用值的BOM(字节顺序标记)
    \xff
    \xfe
    谢谢,我尝试了这个方法,但两个文件完全相同。我不认为BOM是问题所在,因为即使尝试将内存中的DF加载到BQ时,也存在此问题(即,它在任何时候都不会保存)
    self.res_df['clean'] = self.res_df['clean'].apply(lambda x: re.sub(r'[^\x00-\x7f]',r'',x))
    self.res_df['clean'] = self.res_df['clean'].apply(lambda x: bytes(x, 'utf-8').decode('utf-8', 'ignore'))
    self.res_df.to_gbq(datatable,project_id,if_exists='replace')