Python 使用不同格式（csv、json、avro）将数据加载到pd.DataFrame的最快方式_Python_Pandas_Google Bigquery_Avro

Python 使用不同格式（csv、json、avro）将数据加载到pd.DataFrame的最快方式

python pandas google-bigquery

Python 使用不同格式（csv、json、avro）将数据加载到pd.DataFrame的最快方式,python,pandas,google-bigquery,avro,Python,Pandas,Google Bigquery,Avro,我们正在将大量数据从google bigquery加载到pandas dataframe（可以直接作为pandas消费，也可以作为xgbMatrix消费） BQ导出格式有CSV、JSON和AVRO，我们的数据有日期、整数、浮点数和字符串，通常是“宽”（许多列）。我们的第一种方法是将数据作为CSV导入，但解析时间非常长：（32 GB，126个文件，CSV）->25分钟解析代码： def load_table_files_to_pandas(all_files,

我们正在将大量数据从

google bigquery

加载到

pandas dataframe

（可以直接作为

pandas

消费，也可以作为

xgbMatrix

消费）

BQ导出格式有

CSV

、

JSON

和

AVRO

，我们的数据有日期、整数、浮点数和字符串，通常是“宽”（许多列）。我们的第一种方法是将数据作为CSV导入，但解析时间非常长：

（32 GB，126个文件，CSV）->25分钟
解析代码：
def load_table_files_to_pandas(all_files, 
                           table_ref):

# load files to pandas
dict_dtype = {}
date_cols = []

client =  bigquery.Client() # create a bq client
table = client.get_table(table_ref)

for field in table.schema:
    pd_dtypes = {'string':'object',
                 'date':'object',
                 'float':'float64',
                 'integer':'float64'
                 }
    dict_dtype[field.name] = pd_dtypes[field.field_type.lower()]
    if field.field_type.lower()== 'date':
        date_cols.append(field.name)

print('start reading data')    
df_from_each_file = []
for f in all_files:
    # looping over files
    df_from_each_file.append(pd.read_csv(f, 
                                         dtype = dict_dtype, 
                                         parse_dates = date_cols))

    print('memory in use = {}'.format(psutil.virtual_memory().percent))

df = pd.concat(df_from_each_file, ignore_index=True)
print('end reading data')
return df

在pandas
中，哪种格式的解析速度更快<代码>[Avro、CSV、JSON]

？是否有第三个可能未被考虑

附加的 我们还尝试直接从存储和本地磁盘使用

dask | csv

，但解析时间几乎相同。

使用pandas它是专门为

google bigquery

设计的

相反，您可能希望以块的形式导出它，然后构建一个

dask

管道来解析它并并行加载它（并且以比ram更大的方式）

据此,

BigQuery能够以块的形式导出数据，并且您可以请求尽可能多的块

如果数据没有嵌套和重复（请注意），您可以导出到

csv

，并使用

dask

s方法使您的生活更轻松。

处理如此大的文件时，我会使用Spark with Parquet格式。这样你可以扩大你的阅读和计算。Pandas不是为这么大的文件而设计的。

您介意写几行代码来说明如何解析代码吗？关于文件格式，我对

.parquet

非常满意。你可以阅读拼花地板的最新版本或dask。你应该记住韦斯的话McKinney@user32185添加；）@user32185实际上，Python中大小大于2GB的拼花文件目前存在很多问题，不管使用的库是pyarrow还是fastparquet。您将遇到一些限制：不是所有使用spark创建的文件都可以使用任何当前可用的lib读取。