Python h5py：如何读取hdf5文件的选定行？_Python_Numpy_Dataset_H5py

Python h5py：如何读取hdf5文件的选定行？

python numpy

Python h5py：如何读取hdf5文件的选定行？,python,numpy,dataset,h5py,Python,Numpy,Dataset,H5py,是否可以在不加载整个文件的情况下从hdf5文件中读取给定的行集？我有相当大的hdf5文件和大量的数据集，下面是一个我想减少时间和内存使用的示例： #! /usr/bin/env python import numpy as np import h5py infile = 'field1.87.hdf5' f = h5py.File(infile,'r') group = f['Data'] mdisk = group['mdisk'].value val = 2.*pow(10.,10.

是否可以在不加载整个文件的情况下从hdf5文件中读取给定的行集？我有相当大的hdf5文件和大量的数据集，下面是一个我想减少时间和内存使用的示例：

#! /usr/bin/env python

import numpy as np
import h5py

infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']

mdisk = group['mdisk'].value

val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]

m = group['mcold'][ind]
print m

ind

不提供连续的行，而是提供分散的行

上面的代码失败了，但它遵循了切片hdf5数据集的标准方法。我收到的错误消息是：

Traceback (most recent call last):
  File "./read_rows.py", line 17, in <module>
    m = group['mcold'][ind]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
    sel[arg]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
    raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays

回溯（最近一次呼叫最后一次）：
文件“/read_rows.py”，第17行，在
m=组['mcold'][ind]
文件“/cosma/local/Python/2.7.3/lib/python2.7/site packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py”，第425行，在__
selection=sel.select（self.shape，args，dsid=self.id）
文件“/cosma/local/Python/2.7.3/lib/python2.7/site packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py”，第71行，选择
sel[arg]
文件“/cosma/local/Python/2.7.3/lib/python2.7/site packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py”，第209行，在__
raise TypeError（“PointSelection\uuuu getitem\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
TypeError:PointSelection\uuu getitem\uuuuuuuu仅适用于布尔数组

我有一个样本h5py文件，其中包含：

data = f['data']
#  <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind]  # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error

具有适当维度切片的数组也是如此：

In [157]: data[ind[[0,3,6]],:]
Out[157]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]: 
array([[ 0,  3,  5],
       [ 6,  9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]  
# errror about only one indexing array allowed

因此，如果索引是正确的-唯一的值，并且匹配数组维度，那么它应该可以工作

我的简单示例没有测试数组的加载量。文档听起来好像是从文件中选择了元素，而没有将整个数组加载到内存中。

说它“失败”，但没有显示错误消息，或者是什么错误，这是一个很大的禁忌。您正在将整个

mdisk

数组加载到内存中。我必须深入研究文档，以确定加载了多少

mcold

。这可能取决于

ind

是一个紧凑的片还是分散在数组中的值。是的！谢谢这实际上是一个匹配数组维度的问题。在上面的示例代码中，通过：ind=（mdisk>val）更改where语句就足够了。当然，如果在数组中转换h5文件，选择行很容易，但问题是：我们可以在不创建数组的情况下删除行吗？在我的例子中，它非常有用，因为我无法将整个数组加载到内存中。所以我想直接从h5文件中提取行。谢谢lot@Tbertin，my

data

是数据集，而不是加载的数组。因此，我确实演示了如何加载选定的行。切片索引也可以工作。即使数据是一个数据集，只要您写入数据[索引]，您就创建了一个数组，并将所有选定的数据加载到内存中rows@Tbertin，那么

删除

和

提取

是指更改文件本身的数据吗？如果是这样，您需要查看底层的

HDF5

代码，而不是python接口。

In [157]: data[ind[[0,3,6]],:]
Out[157]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]: 
array([[ 0,  3,  5],
       [ 6,  9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]  
# errror about only one indexing array allowed