Pandas 同时读取熊猫中的HDF5文件_Pandas_Concurrency_Pytables_H5py

Pandas 同时读取熊猫中的HDF5文件

pandas concurrency

Pandas 同时读取熊猫中的HDF5文件,pandas,concurrency,pytables,h5py,Pandas,Concurrency,Pytables,H5py,我有一个data.h5文件，它被组织成多个块，整个文件有几百GB。我需要处理内存中文件的过滤子集，以数据帧的形式以下例程的目标是将过滤工作分布到多个进程中，然后将过滤结果连接到最终数据帧中由于从文件读取要花费大量的时间，所以我尝试让每个进程也以并发方式读取自己的块 import multiprocessing as mp, pandas as pd store = pd.HDFStore('data.h5') min_dset, max_dset = 0, len(store.keys()

我有一个

data.h5

文件，它被组织成多个块，整个文件有几百GB。我需要处理内存中文件的过滤子集，以数据帧的形式

以下例程的目标是将过滤工作分布到多个进程中，然后将过滤结果连接到最终数据帧中

由于从文件读取要花费大量的时间，所以我尝试让每个进程也以并发方式读取自己的块

import multiprocessing as mp, pandas as pd

store = pd.HDFStore('data.h5')
min_dset, max_dset = 0, len(store.keys()) - 1
dset_list = list(range(min_dset, max_dset))

frames = []

def read_and_return_subset(dset):
    # each process is intended to read its own chunk in a concurrent manner
    chunk = store.select('batch_{:03}'.format(dset))

    # and then process the chunk, do the filtering, and return the result
    output = chunk[chunk.some_condition == True]
    return output


with mp.Pool(processes=32) as pool:
    for frame in pool.map(read_and_return_subset, dset_list):
        frames.append(frame)

df = pd.concat(frames)

但是，上述代码会触发此错误：

HDF5ExtError                              Traceback (most recent call last)
<ipython-input-174-867671c5a58f> in <module>()
     53 
     54     with mp.Pool(processes=32) as pool:
---> 55         for frame in pool.map(read_and_return_subset, dset_list):
     56             frames.append(frame)
     57 

/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 554, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1856, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 340, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1262, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3574, in H5C_protect
    can't load entry
  File "H5C.c", line 7954, in H5C_load_entry
    unable to load entry
  File "H5Bcache.c", line 143, in H5B__load
    wrong B-tree signature

End of HDF5 error back trace

Problems reading the array data.

HDF5ExtError回溯（最近一次调用）
在（）
53
54以mp.Pool（进程=32）作为池：
--->55用于池中的帧映射（读取和返回子集，数据集列表）：
56帧。追加（帧）
57
/映射中的usr/lib/python3.5/multiprocessing/pool.py（self、func、iterable、chunksize）
258在返回的列表中。
259         '''
-->260返回self.\u map\u async（func、iterable、mapstar、chunksize）.get（）
261
262 def星图（self、func、iterable、chunksize=None）：
/get中的usr/lib/python3.5/multiprocessing/pool.py（self，timeout）
606返回自身值
607其他：
-->608提高自我价值
609
610 def_装置（自身、i、obj）：
HDF5ExtError:HDF5错误回溯跟踪
H5格式文件“H5Dio.c”，第173行
无法读取数据
文件“H5Dio.c”，第554行，以H5D___读取
无法读取数据
文件“H5Dchunk.c”，第1856行，以H5D\uu chunk\u读取
查找区块地址时出错
文件“H5Dchunk.c”，第2441行，在H5D\uu chunk\u查找中
无法查询区块地址
文件“H5Dbtree.c”，第998行，在H5D_uubtree_idx_get_addr中
无法获取区块信息
文件“H5B.c”，第340行，在H5B_find中
无法加载B树节点
文件“H5AC.c”，第1262行，在H5AC\u保护中
H5C_protect（）失败。
文件“H5C.c”，第3574行，在H5C_保护中
无法加载条目
文件“H5C.c”，第7954行，在H5C_加载_条目中
无法加载条目
文件“H5Bcache.c”，第143行，在H5B_uu加载中
错误的B-树签名
HDF5错误回溯跟踪结束
读取阵列数据时出现问题。

Pandas/pyTables在试图以并发方式访问同一文件时似乎遇到了问题，即使它只是用于读取

有没有一种方法可以使每个进程同时读取自己的数据块？

IIUC您可以索引用于筛选数据的列（

chunk.some_condition==True

-在示例代码中），然后只读取满足所需条件的数据子集

为了能够做到这一点，您需要：

将HDF5文件保存为

table

格式-使用参数：

format='table'

索引列，将用于筛选-使用参数：

data\u columns=['col\u name1'、'col\u name2'等]

之后，您只需阅读以下内容即可过滤数据：

store = pd.HDFStore(filename)
df = store.select('key_name', where="col1 in [11,13] & col2 == 'AAA'")

我不确定你能不能通过并行化来提高IO的速度——你仍然需要从磁盘读取同样数量的数据，再加上“多处理”开销。你将如何处理你正在阅读的区块？你需要处理所有的数据还是只处理其中的一部分？@MaxU我在阅读了你的评论后更深入地挖掘了一下，看起来大部分时间都花在了h5库处理文件上，而不是从磁盘上传输文件。证据是，将相同的文件放入基于RAM的挂载磁盘中根本看不到速度的提高。因此，我认为同时读取文件可能会加快整个过程。不过，测试表明，对以

表格式保存的文件进行写入/读取要比使用固定的格式进行写入/读取慢得多。@Jivan，写入肯定会慢得多（因为开销和额外的索引）。当使用where
子句时，阅读通常会更快。如果您经常需要从磁盘读取所有内容，那么您可能希望保持固定格式。。。