Python 读取h5文件时释放内存_Python_Memory Management_Python Xarray_H5py

Python 读取h5文件时释放内存

python memory-management

Python 读取h5文件时释放内存,python,memory-management,python-xarray,h5py,Python,Memory Management,Python Xarray,H5py,我有一堆h5文件，每个大约200 GB。这些文件的结构如下所示： file1.h5 ├image [float64: 3341 × 126 × 256 × 256] ├pulse [uint64: 126] └train [uint64: 3341] 我编写了以下代码来读取这些文件： def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0): """ Read image

我有一堆h5文件，每个大约200 GB。这些文件的结构如下所示：

file1.h5
├image  [float64: 3341 × 126 × 256 × 256]
├pulse  [uint64: 126]
└train  [uint64: 3341]

我编写了以下代码来读取这些文件：

def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0):
    """
    Read image data from a h5 file to a xarray
    """
    hf = h5py.File(file_name, 'r')
    
    if not pulse_avg:
        coords = {'train': np.array(hf.get(f'train')),
                  'pulse': np.array(hf.get(f'pulse')),}
        dims = ['train','pulse','slow_scan','fast_scan']
        xarr = xr.DataArray(np.array(hf.get(f'image')),
                                     dims=dims, coords=coords)
        del hf
        return xarr.isel(train=train_idx).isel(pulse=pulse_idx)
    
    else:
        coords = {'train': np.array(hf.get(f'train'))}
        dims = ['train','slow_scan','fast_scan']
        xarr = xr.DataArray(np.array(hf.get(f'image')),
                            dims=dims, coords=coords)
        del hf
        return xarr

请注意，我正在显式删除用于读取文件的hf对象

读取整个文件时，内存使用情况与预期一致，因为对象很大：

dummy = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
                         pulse_idx=slice(None))
dummy.nbytes * (2 ** -30)
205.2626953125

读取前使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

读取后使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

但是，如果我读取同一个文件，但保留较小的版本（只有两个脉冲而不是126个脉冲），则对象的大小明显较小，但内存不会释放：

dummy_reduced = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
                                  pulse_idx=slice(None,2))
dummy_reduced.nbytes * (2 ** -30)
3.2626953125

读取前使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

读取后使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

我如何释放内存来连接3个以上的h5文件？执行此任务的代码类似于：

test = xr.concat([images_from_disk(file, pulse_avg=False,
                                   train_idx=slice(None,10),
                                   pulse_idx=slice(None,2)) for file in my_files],
                pd.Index([int(file.stem[-2:]) for file in my_files], name='module'))

hf

是一个打开的文件。您可以关闭它，而不是

del

。它不占用太多内存。内存使用来自于从数据集创建numpy数组：

np.array（hf.get（f'train'）

。从

h5py

文档中，加载数组的首选方法是

arr=hf['train'[:]

。这样，您也可以加载部分，

arr=hf['train'][:n]

谢谢@hpaulj！它解决了我的问题：D