Python 从损坏的文件中恢复数据
我有一个HDF5文件由于某种原因被破坏了。我正在尝试检索文件中基本正常的部分。我可以从不包含损坏字段的组中读取所有数据集。但是,我无法从同时具有损坏数据集的组中读取任何未损坏的数据集 然而有趣的是,我可以使用HDFView轻松地读取这些数据集。也就是说,我可以打开它们,找到所有的数值。使用HDFView,我只能读取损坏的数据集 我的问题是我如何利用这一点,并检索尽可能多的数据 使用h5py读取时:Python 从损坏的文件中恢复数据,python,hdf5,h5py,Python,Hdf5,H5py,我有一个HDF5文件由于某种原因被破坏了。我正在尝试检索文件中基本正常的部分。我可以从不包含损坏字段的组中读取所有数据集。但是,我无法从同时具有损坏数据集的组中读取任何未损坏的数据集 然而有趣的是,我可以使用HDFView轻松地读取这些数据集。也就是说,我可以打开它们,找到所有的数值。使用HDFView,我只能读取损坏的数据集 我的问题是我如何利用这一点,并检索尽可能多的数据 使用h5py读取时: Traceback (most recent call last): File "repair
Traceback (most recent call last):
File "repair.py", line 44, in <module>
print(data['/dt_yield/000000'][...])
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/usr/local/lib/python3.6/site-packages/h5py/_hl/group.py", line 167, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (bad heap free list)'
恢复脚本(使用h5py)
到目前为止,我已经实现了至少恢复h5py可以直接读取的所有内容:
import numpy as np
import h5py, os, time
def getdatasets(key,archive):
if key[-1] != '/': key += '/'
out = []
for name in archive[key]:
path = key + name
if isinstance(archive[path], h5py.Dataset):
out += [path]
else:
try : out += getdatasets(path,archive)
except: pass
return out
data = h5py.File('data.hdf5' ,'r')
fixed = h5py.File('fixed.hdf5','w')
datasets = getdatasets('/',data)
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]
idx = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]
for group in groups:
fixed.create_group(group)
for path in datasets:
# - check path
if path not in data: continue
# - try reading
try : data[path]
except: continue
# - get group name
group = path[::-1].split('/',1)[1][::-1]
# - minimum group name
if len(group) == 0: group = '/'
# - copy data
data.copy(path, fixed[group])
我找到了一种简单的方法来恢复所有不包含断开节点的顶级组。可以通过递归调用简单地扩展到较低级别的组
import h5py as h5
def RecoverFile( f1, f2 ):
""" recover read-open HDF5 file f1 to write-open HDF5 file f2 """
names = []
f1.visit(names.append)
for n in names:
try:
f2.create_dataset( n, data=f1[n][()] )
except:
pass
with h5.File( file_broken, 'r' ) as fb:
with h5.File( file_recover, 'w' ) as fr:
for key in fb.keys():
try:
fr.create_dataset( key, data=fb[key][()] )
except:
try:
fr.create_group(key)
RecoverFile( fb[key], fr[key] )
except:
fr.__delitem__(key)
也许这很愚蠢,但您可以从HDFView导出数据(在数据集>导出中单击鼠标右键)。根据数据集的数量,它可能会很乏味,但您可以选择它。@pablo\u worker。谢谢,是的,很好。我正在寻找一个自动化的工具。谢谢你的回答。我认为这可能需要推广。如果数据集存储在根目录上怎么办:如果将
/a
作为数据集怎么办?另外,.value
方法似乎已被弃用。感谢您的建议!我相应地改变了答案。太好了!仅供参考,现在也有一个命令行脚本来执行此操作:
import h5py as h5
def RecoverFile( f1, f2 ):
""" recover read-open HDF5 file f1 to write-open HDF5 file f2 """
names = []
f1.visit(names.append)
for n in names:
try:
f2.create_dataset( n, data=f1[n][()] )
except:
pass
with h5.File( file_broken, 'r' ) as fb:
with h5.File( file_recover, 'w' ) as fr:
for key in fb.keys():
try:
fr.create_dataset( key, data=fb[key][()] )
except:
try:
fr.create_group(key)
RecoverFile( fb[key], fr[key] )
except:
fr.__delitem__(key)