Python pandas:使用区块读取json时数据类型不一致

Python pandas:使用区块读取json时数据类型不一致,python,json,pandas,parquet,dask,Python,Json,Pandas,Parquet,Dask,TL;DR 如何强制pd.read_json读取数据块的数据类型 背景 我需要读取一个大数据集,当前存储在行delimeted json中,大约300万行。 我正在尝试将其剪切成小拼花文件,以便能够使用dask流式传输完整的数据集 我的基本想法是: _chunks =pd.read_json('data.json', lines=True, chunksize=5000) i = 0 for c in _chunks: c.to_parquet('parquet/data.%s.pqt'

TL;DR
如何强制pd.read_json读取数据块的数据类型

背景
我需要读取一个大数据集,当前存储在行delimeted json中,大约300万行。 我正在尝试将其剪切成小拼花文件,以便能够使用dask流式传输完整的数据集

我的基本想法是:

_chunks =pd.read_json('data.json', lines=True, chunksize=5000)
i = 0
for c in _chunks:
   c.to_parquet('parquet/data.%s.pqt' % i)
   i = i+1

ddf = dataframe.read_parquet('parquet/*', index='_id')
ddf.compute()
但由于数据类型中的某些不一致性,我得到了错误,仅针对某些分区:

>>> ddf.get_partition(8).compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/base.py", line 135, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/base.py", line 333, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/dataframe/io/parquet.py", line 335, in _read_parquet_row_group
    open=open, assign=views, scheme=scheme)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 284, in read_row_group_file
    scheme=scheme)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 334, in read_row_group
    cats, selfmade, assign=assign)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 311, in read_row_group_arrays
    catdef=out[name+'-catdef'] if use else None)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 266, in read_col
    piece[:] = dic[val]
ValueError: invalid literal for int() with base 10: ''
问题是,当我检查块时,它们的数据类型并不都相同

for c in _chunks:
    c.dtypes
    # print some columns as bool or int64 or object dependending of the chunk

在我看来,在这里可以做的最简单的事情就是在编写之前强制执行数据类型。由于这在
read_json
函数中似乎无法正常工作,因此可以应用它

for c in _chunks:
c.astype(_column_types).to_parquet('parquet/data.%s.pqt' % i)
i = i+1

请注意,我会考虑每一个拼花文件的5000个记录太小,无法充分利用该格式。每个组件拼花地板文件的典型大小通常大于10MB

似乎data.jsonthx@RomainJouin data.json中的一个问题是从ElasticSearch中转储的,所以我想这是相当标准的。。。
for c in _chunks:
c.astype(_column_types).to_parquet('parquet/data.%s.pqt' % i)
i = i+1