Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 读取h5文件时释放内存_Python_Memory Management_Python Xarray_H5py - Fatal编程技术网

Python 读取h5文件时释放内存

Python 读取h5文件时释放内存,python,memory-management,python-xarray,h5py,Python,Memory Management,Python Xarray,H5py,我有一堆h5文件,每个大约200 GB。这些文件的结构如下所示: file1.h5 ├image [float64: 3341 × 126 × 256 × 256] ├pulse [uint64: 126] └train [uint64: 3341] 我编写了以下代码来读取这些文件: def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0): """ Read image

我有一堆h5文件,每个大约200 GB。这些文件的结构如下所示:

file1.h5
├image  [float64: 3341 × 126 × 256 × 256]
├pulse  [uint64: 126]
└train  [uint64: 3341]
我编写了以下代码来读取这些文件:

def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0):
    """
    Read image data from a h5 file to a xarray
    """
    hf = h5py.File(file_name, 'r')
    
    if not pulse_avg:
        coords = {'train': np.array(hf.get(f'train')),
                  'pulse': np.array(hf.get(f'pulse')),}
        dims = ['train','pulse','slow_scan','fast_scan']
        xarr = xr.DataArray(np.array(hf.get(f'image')),
                                     dims=dims, coords=coords)
        del hf
        return xarr.isel(train=train_idx).isel(pulse=pulse_idx)
    
    else:
        coords = {'train': np.array(hf.get(f'train'))}
        dims = ['train','slow_scan','fast_scan']
        xarr = xr.DataArray(np.array(hf.get(f'image')),
                            dims=dims, coords=coords)
        del hf
        return xarr
请注意,我正在显式删除用于读取文件的hf对象

读取整个文件时,内存使用情况与预期一致,因为对象很大:

dummy = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
                         pulse_idx=slice(None))
dummy.nbytes * (2 ** -30)
205.2626953125
读取前使用的内存:

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G
              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G
读取后使用的内存:

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G
              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G
但是,如果我读取同一个文件,但保留较小的版本(只有两个脉冲而不是126个脉冲),则对象的大小明显较小,但内存不会释放:

dummy_reduced = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
                                  pulse_idx=slice(None,2))
dummy_reduced.nbytes * (2 ** -30)
3.2626953125
读取前使用的内存:

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G
              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G
读取后使用的内存:

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G
              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G
我如何释放内存来连接3个以上的h5文件? 执行此任务的代码类似于:

test = xr.concat([images_from_disk(file, pulse_avg=False,
                                   train_idx=slice(None,10),
                                   pulse_idx=slice(None,2)) for file in my_files],
                pd.Index([int(file.stem[-2:]) for file in my_files], name='module'))

hf
是一个打开的文件。您可以关闭它,而不是
del
。它不占用太多内存。内存使用来自于从数据集创建numpy数组:
np.array(hf.get(f'train')
。从
h5py
文档中,加载数组的首选方法是
arr=hf['train'[:]
。这样,您也可以加载部分,
arr=hf['train'][:n]
谢谢@hpaulj!它解决了我的问题:D