如何将带有where子句的Pytables HDF5s懒洋洋地读入Dask？_Dask_Pytables

如何将带有where子句的Pytables HDF5s懒洋洋地读入Dask？

dask

如何将带有where子句的Pytables HDF5s懒洋洋地读入Dask？,dask,pytables,Dask,Pytables,我将数据存储在由pytables创建的HDF5文件中。我需要使用read_where（）读取这些文件，因为我需要应用一些筛选条件我在dask文档中看到这个整洁的h5py： dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')] arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets] 通常，我可以用pytables而不是h

我将数据存储在由

pytables

创建的HDF5文件中。我需要使用

read_where（）

读取这些文件，因为我需要应用一些筛选条件

我在

dask

文档中看到这个整洁的

h5py

：

dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]

通常，我可以用

pytables

而不是

h5py

来做类似的事情：

dsets = [tables.File(fn).root.data for fn in sorted(glob('myfiles.*.hdf5')]

但是，我无法在

pytables

中找到一种方法，在应用了可延迟读取的筛选器的情况下，将类似于表的结果返回到

dask

read_where（）

将整个数组读取到内存中，因此我无法执行此操作，因为我的数据大于内存：

dsets = [tables.File(fn).root.data.read_where('color == "blue"') for fn in sorted(glob('myfiles.*.hdf5')]

在

pytables

中有没有办法解决这个问题？否则，是否有一种方法可以将我的read函数封装在这样的生成器中，并让dask使用它来创建数组

def array_generator():
    for fn in sorted(glob('myfiles.*.hdf5'):
        yield tables.File(fn).root.data.read_where('color == "blue"')

您是否在此处读取表格数据，即，您希望生成一个数据帧而不是数组？@mdurant它是一个numpy结构化数组，而不是数据帧。（我怀疑这可能与dask延迟有关。）