Python 熊猫:合并具有不同列数据类型的拼花文件-使用预定义模式编写拼花?
我需要将非常大的DB表导出到s3。 我通过并行化pandas read_sql(使用processpool),并使用我的表的主键id为每个工作者生成一个要选择的范围来实现这一点这导致出口速度非常快Python 熊猫:合并具有不同列数据类型的拼花文件-使用预定义模式编写拼花?,python,pandas,dataframe,Python,Pandas,Dataframe,我需要将非常大的DB表导出到s3。 我通过并行化pandas read_sql(使用processpool),并使用我的表的主键id为每个工作者生成一个要选择的范围来实现这一点这导致出口速度非常快 process 1: id between 1 and 9 -> 1.pq process 2: id between 10 and 19 -> 2.pq process 3: id between 20 and 29 -> 3.pq 每个工作进程将生成的数据帧get写入同一文
process 1: id between 1 and 9 -> 1.pq
process 2: id between 10 and 19 -> 2.pq
process 3: id between 20 and 29 -> 3.pq
每个工作进程将生成的数据帧get写入同一文件夹
问题在于我的数据:我拥有的一些列并不总是填充的(例如,Deleted?null vs 1)-因此我的一些拼花将删除的列数据类型设置为null,而其他列数据类型设置为Int64
当我试图从pyarrow、fastparquet或pyspark中读取数据集时,我会收到关于模式的各种错误
我曾尝试研究箭头表,但到目前为止,只找到了一种方法来定义一个仅用于验证的模式
复制:
import pandas as pd
import pyarrow.parquet as pq
data=pd.DataFrame([[1,None],[1,None]])
data2=pd.DataFrame([[1,1],[1,1]])
data.columns = data.columns.astype(str) ## Parquet requires string column names
data2.columns = data2.columns.astype(str)
data.to_parquet('./outputs/1.pq')
data2.to_parquet('./outputs/2.pq')
pq.ParquetDataset('./outputs')
我希望它能推断出我的列“1”是int,但这是一个冲突。我尝试禁用模式验证,但这只是隐藏问题,直到我实际处理它
ValueError: Schema in ../outputs/2.pq was different.
0: int64
1: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "0", "f'
b'ield_name": "0", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}, {"name": "1", "field_name": "1", "pandas_type'
b'": "int64", "numpy_type": "int64", "metadata": null}], "creator"'
b': {"library": "pyarrow", "version": "0.14.0"}, "pandas_version":'
b' "0.24.2"}'}
vs
0: int64
1: null
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "0", "f'
b'ield_name": "0", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}, {"name": "1", "field_name": "1", "pandas_type'
b'": "empty", "numpy_type": "object", "metadata": null}], "creator'
b'": {"library": "pyarrow", "version": "0.14.0"}, "pandas_version"'
b': "0.24.2"}'}
您可以创建自己的自定义“pyarrow模式”,并使用模式强制转换每个pyarrow表
import pyarrow as pa
import pyarrow.parquet as pq
def merge_small_parquet_files(small_files, result_file):
pqwriter = None
for small_file in small_files:
table = pq.read_table(small_file)
pyarrow_schema = get_pyarrow_schema()
if not pqwriter:
pqwriter = pq.ParquetWriter(result_file,
schema=pyarrow_schema,
compression='GZIP',
coerce_timestamps='ms', allow_truncated_timestamps=True)
table = table.cast(pyarrow_schema)
pqwriter.write_table(table)
table = None
del table
if pqwriter:
pqwriter.close()
def get_pyarrow_schema():
fields = []
fields.append(pa.field('first_name', pa.string()))
fields.append(pa.field('last_name', pa.string()))
fields.append(pa.field('Id', pa.float64()))
fields.append(pa.field('Salary', pa.float64()))
fields.append(pa.field('Time', pa.timestamp('ms')))
pyarrow_schema = pa.schema(fields)
return pyarrow_schema
if __name__ == '__main__':
small_files = ['file1.parquet', 'file2.parquet', 'file3.parquet', 'file4.parquet']
result_file = 'large.parquet'
merge_small_parquet_files(small_files, result_file)