Python 读取h5文件时释放内存
我有一堆h5文件,每个大约200 GB。这些文件的结构如下所示:Python 读取h5文件时释放内存,python,memory-management,python-xarray,h5py,Python,Memory Management,Python Xarray,H5py,我有一堆h5文件,每个大约200 GB。这些文件的结构如下所示: file1.h5 ├image [float64: 3341 × 126 × 256 × 256] ├pulse [uint64: 126] └train [uint64: 3341] 我编写了以下代码来读取这些文件: def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0): """ Read image
file1.h5
├image [float64: 3341 × 126 × 256 × 256]
├pulse [uint64: 126]
└train [uint64: 3341]
我编写了以下代码来读取这些文件:
def read_h5(file_name, pulse_avg=True, train_idx=0, pulse_idx=0):
"""
Read image data from a h5 file to a xarray
"""
hf = h5py.File(file_name, 'r')
if not pulse_avg:
coords = {'train': np.array(hf.get(f'train')),
'pulse': np.array(hf.get(f'pulse')),}
dims = ['train','pulse','slow_scan','fast_scan']
xarr = xr.DataArray(np.array(hf.get(f'image')),
dims=dims, coords=coords)
del hf
return xarr.isel(train=train_idx).isel(pulse=pulse_idx)
else:
coords = {'train': np.array(hf.get(f'train'))}
dims = ['train','slow_scan','fast_scan']
xarr = xr.DataArray(np.array(hf.get(f'image')),
dims=dims, coords=coords)
del hf
return xarr
请注意,我正在显式删除用于读取文件的hf对象
读取整个文件时,内存使用情况与预期一致,因为对象很大:
dummy = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
pulse_idx=slice(None))
dummy.nbytes * (2 ** -30)
205.2626953125
读取前使用的内存:
total used free shared buff/cache available
Mem: 754G 41G 697G 18M 15G 710G
Swap: 4.0G 132M 3.9G
total used free shared buff/cache available
Mem: 754G 41G 697G 18M 15G 710G
Swap: 4.0G 132M 3.9G
读取后使用的内存:
total used free shared buff/cache available
Mem: 754G 247G 491G 18M 15G 504G
Swap: 4.0G 132M 3.9G
total used free shared buff/cache available
Mem: 754G 247G 491G 18M 15G 504G
Swap: 4.0G 132M 3.9G
但是,如果我读取同一个文件,但保留较小的版本(只有两个脉冲而不是126个脉冲),则对象的大小明显较小,但内存不会释放:
dummy_reduced = images_from_disk('file1.h5', pulse_avg=False,train_idx=slice(None),
pulse_idx=slice(None,2))
dummy_reduced.nbytes * (2 ** -30)
3.2626953125
读取前使用的内存:
total used free shared buff/cache available
Mem: 754G 41G 697G 18M 15G 710G
Swap: 4.0G 132M 3.9G
total used free shared buff/cache available
Mem: 754G 41G 697G 18M 15G 710G
Swap: 4.0G 132M 3.9G
读取后使用的内存:
total used free shared buff/cache available
Mem: 754G 247G 491G 18M 15G 504G
Swap: 4.0G 132M 3.9G
total used free shared buff/cache available
Mem: 754G 247G 491G 18M 15G 504G
Swap: 4.0G 132M 3.9G
我如何释放内存来连接3个以上的h5文件?
执行此任务的代码类似于:
test = xr.concat([images_from_disk(file, pulse_avg=False,
train_idx=slice(None,10),
pulse_idx=slice(None,2)) for file in my_files],
pd.Index([int(file.stem[-2:]) for file in my_files], name='module'))
hf
是一个打开的文件。您可以关闭它,而不是del
。它不占用太多内存。内存使用来自于从数据集创建numpy数组:np.array(hf.get(f'train')
。从h5py
文档中,加载数组的首选方法是arr=hf['train'[:]
。这样,您也可以加载部分,arr=hf['train'][:n]
谢谢@hpaulj!它解决了我的问题:D