Python 400使用pandas to_gbq在Google BigQuery中创建表时读取数据时出错

Python 400使用pandas to_gbq在Google BigQuery中创建表时读取数据时出错,python,pandas,google-bigquery,Python,Pandas,Google Bigquery,我正在尝试从MySQL服务器查询数据,并使用pandas.to_gbq api将其写入GoogleBigQuery def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project): # Extract data from Production q = """ SELECT * FROM {} """.format(ta

我正在尝试从MySQL服务器查询数据,并使用pandas.to_gbq api将其写入GoogleBigQuery

def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project):
    # Extract data from Production

    q = """
        SELECT *
        FROM
            {}
        """.format(table_name_prod)

    df = pd.read_sql(q, con)

    # Write to gbq    
    df.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)

    return df
我一直收到一个400错误,指示输入无效

Load is 100.0% Complete
---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    569                     self.client, dataframe, dataset_id, table_id,
--> 570                     chunksize=chunksize):
    571                 self._print("\rLoad is {0}% Complete".format(

/usr/local/lib/python3.6/site-packages/pandas_gbq/_load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema)
     73             destination_table,
---> 74             job_config=job_config).result()

/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/job.py in result(self, timeout)
    527         # TODO: modify PollingFuture so it can pass a retry argument to done().
--> 528         return super(_AsyncJob, self).result(timeout=timeout)
    529 

/usr/local/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout)
    110             # Pylint doesn't recognize that this is valid in this case.
--> 111             raise self._exception
    112 

BadRequest: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.

During handling of the above exception, another exception occurred:

GenericGBQException                       Traceback (most recent call last)
<ipython-input-73-ef9c7cec0104> in <module>()
----> 1 departments.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
      2 

/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
   1058         return gbq.to_gbq(self, destination_table, project_id=project_id,
   1059                           chunksize=chunksize, verbose=verbose, reauth=reauth,
-> 1060                           if_exists=if_exists, private_key=private_key)
   1061 
   1062     @classmethod

/usr/local/lib/python3.6/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    107                       chunksize=chunksize,
    108                       verbose=verbose, reauth=reauth,
--> 109                       if_exists=if_exists, private_key=private_key)

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
    980     connector.load_data(
    981         dataframe, dataset_id, table_id, chunksize=chunksize,
--> 982         schema=table_schema)
    983 
    984 

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    572                     ((total_rows - remaining_rows) * 100) / total_rows))
    573         except self.http_error as ex:
--> 574             self.process_http_error(ex)
    575 
    576         self._print("\n")

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
    453         # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    454 
--> 455         raise GenericGBQException("Reason: {0}".format(ex))
    456 
    457     def run_query(self, query, **kwargs):

GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.
它与数据帧相同:

id                        int64
name                     object
description              object
created_at                int64
modified_at             float64
该表在GBQ中创建,但仍为空

我读了一些关于pandas.to_gbq api的文章,但没有找到太多,除了这篇似乎相关但没有回复的文章:

我发现了一个关于对象数据类型中的数字的潜在解决方案,这些数据类型不带引号地传递到GBQ表中,通过将列数据类型设置为字符串来修复

我尝试了修复:

for col in df.columns:
    if df[col].dtypes == object:
        df[col] = df[col].fillna('')
        df[col] = df[col].astype(str)
不幸的是,我仍然得到同样的错误。同样,尝试格式化丢失的数据并为int和float设置数据类型也会产生相同的错误


我缺少什么吗?

发现bigquery无法正确处理\r(有时\n) 有相同的问题,解决了问题,当我用空格替换\r时,我真的很惊讶:

for col in list(df.columns):
    df[col] = df[col].apply(lambda x: x.replace(u'\r', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)

当我从云存储上的拼花文件导入到bigquery时,我曾多次出现过类似的问题。然而,每次我都忘记了解决它的方法,所以我希望把我的发现留在这里不会太违反协议

我意识到我的列都是空的,看起来它们在pandas中有一个数据类型,但是如果使用pyarrow.parquet.read_模式(parquet_文件),您将看到数据类型是空的


删除列后,上载将正常工作

我在
string
列中有一些无效字符(
object
pandas
中)。我使用@Echochi方法,效果很好

for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub('[^A-Za-z0-9]+','', str(x)))
它对接受的字符有一点限制,因此我使用了一种更通用的方法,因为biquery与
UTF-8


使用
r“[^\u0900-\u097F]+”
您将接受所有
UTF-8
兼容字符集

我建议分解步骤以隔离问题。例如,将数据帧(或示例)保存到csv,并尝试使用UI导入它,或尝试使用
编写一些控制数据帧。to_gbq
unicode
方面存在问题,因此尝试了
字节
,但不幸的是,修复对我无效。@Echochi它应该是字符串,而不是字节。我相信如果您有一些意外的符号使脚本崩溃,您需要使用unicode Data.normalize之类的东西。@SirJ不幸的是,出于某种原因,尝试
unicode.normalize
无法工作,无论我使用何种形式。我最终删除了所有特殊字符,尽管这会对文本造成一点破坏:
df[col]=df[col].apply(lambda x:re.sub(“[^a-Za-z0-9]+',”,x))
,但这起到了作用。@Echochi如果它起作用,
df[col]=df[col]。apply(lambda x:re.sub(“[^a-Za-z0-9]+',”,x))
,然后:1。只需查看结果bigquery表。它有多少个项目。默认情况下,
to_gbq
区块中的项目数为1000个。所以,若表中的1M只有20k,那个么20k和21k之间的某个地方就是“坏数据”。2.尝试对每个列分别使用此函数,以确定在列表列中包含“坏”数据。通过这种方式,您可以检测到什么样的数据正在使脚本崩溃。
for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub('[^A-Za-z0-9]+','', str(x)))
for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub(r"[^\u0900-\u097F]+",,'?', str(x)))