我是否能够将目录路径转换为可以输入python hdf5数据表的内容?

我是否能够将目录路径转换为可以输入python hdf5数据表的内容?,python,numpy,hdf5,h5py,Python,Numpy,Hdf5,H5py,我想知道,如何将字符串或路径转换为可以输入hdf5表的内容。例如,我从Pytorch数据加载器返回图像的numpy img数组、标签和路径,其中图像的路径如下所示: ('mults/train/0/5678.ndpi/40x/40x-236247-16634-80384-8704.png',) hdf5_file = h5py.File(path, mode='w') hdf5_file.create_dataset(str(phase) + '_img_paths', (len(datalo

我想知道,如何将字符串或路径转换为可以输入hdf5表的内容。例如,我从Pytorch数据加载器返回图像的numpy img数组、标签和路径,其中图像的路径如下所示:

('mults/train/0/5678.ndpi/40x/40x-236247-16634-80384-8704.png',)
hdf5_file = h5py.File(path, mode='w')
hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),))
我基本上希望将其输入hdf5表,如下所示:

('mults/train/0/5678.ndpi/40x/40x-236247-16634-80384-8704.png',)
hdf5_file = h5py.File(path, mode='w')
hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),))
我不确定我想做的事是否可行。也许我把这些数据输入表格是错误的

我试过:

hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),),dtype="S10")
但是得到这个错误:

 hdf5_file[str(phase) + '_img_paths'][i] = str(paths40x)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/anaconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 708, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
  File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U64')
hdf5_文件[str(相位)+''img_路径][i]=str(路径40x)
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“/anaconda3/lib/python3.6/site packages/h5py/_hl/dataset.py”,第708行,在u_setitem中__
self.id.write(mspace、fspace、val、mtype、dxpl=self.\u dxpl)
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“h5py/h5d.pyx”,第211行,位于h5py.h5d.DatasetID.write中
文件“h5py/h5t.pyx”,第1652行,在h5py.h5t.py_创建
文件“h5py/h5t.pyx”,第1713行,在h5py.h5t.py_创建

TypeError:dtype:dtype(“在保存字符串数据时,您有两种选择:

  • 您可以在h5py或PyTables中创建标准数据集,并使用 任意大的字符串大小。这是最简单的方法,但存在任意大的字符串不够大的风险。:)
  • 或者,您可以创建可变长度的数据集。PyTables将此数据集类型称为VLArray,它使用的对象是类VLStringAtom()。h5py使用标准数据集,但数据类型引用特殊的数据类型(vlen=str)(注意,如果使用的是h5py 2.10,则可以使用string_dtype()
  • 我创建了一个示例,演示了如何对PyTables和h5py执行此操作。它是围绕您的评论中引用的程序构建的。我并没有复制所有的代码——只是检索文件名并洗牌它们所必需的。此外,我发现的kaggle数据集具有不同的目录结构,因此我修改了
    cat\u dog\u train\u path
    变量以匹配

    from random import shuffle
    import glob
    shuffle_data = True  # shuffle the addresses before saving
    cat_dog_train_path = '.\PetImages\*\*.jpg'
    
    # read addresses and labels from the 'train' folder
    addrs = glob.glob(cat_dog_train_path, recursive=True)
    print (len(addrs))
    labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog
    
    # to shuffle data
    if shuffle_data:
        c = list(zip(addrs, labels))
        shuffle(c)
        addrs, labels = zip(*c)
    
    # Divide the data into 10% train only, no validation or test
    train_addrs = addrs[0:int(0.1*len(addrs))]
    train_labels = labels[0:int(0.1*len(labels))]
    
    print ('Check glob list data:')
    print (train_addrs[0])
    print (train_addrs[-1])
    
    import tables as tb
    
    # Create a hdf5 file with PyTaables and create VLArrays
    # filename to save the hdf5 file
    hdf5_path = 'PetImages_data_1.h5'  
    with tb.File(hdf5_path, mode='w') as h5f:
        train_files_ds = h5f.create_vlarray('/', 'train_files', 
                                            atom=tb.VLStringAtom() )
        # loop over train addresses
        for i in range(len(train_addrs)):
            # print how many images are saved every 1000 images
            if i % 500 == 0 and i > 1:
                print ('Train data: {}/{}'.format(i, len(train_addrs)) )
            addr = train_addrs[i]
            train_files_ds.append(train_addrs[i].encode('utf-8'))
    
    with tb.File(hdf5_path, mode='r') as h5f:
        train_files_ds = h5f.root.train_files
        print ('Check PyTables data:')
        print (train_files_ds[0].decode('utf-8'))
        print (train_files_ds[-1].decode('utf-8'))
    
    import h5py
    
    # Create a hdf5 file with h5py and create VLArrays
    # filename to save the hdf5 file
    hdf5_path = 'PetImages_data_2.h5'  
    with h5py.File(hdf5_path, mode='w') as h5f:
        dt = h5py.special_dtype(vlen=str) # can use string_dtype() wiuth h5py 2.10
        train_files_ds = h5f.create_dataset('/train_files', (len(train_addrs),), 
                                            dtype=dt )
    
        # loop over train addresses
        for i in range(len(train_addrs)):
            # print how many images are saved every 1000 images
            if i % 500 == 0 and i > 1:
                print ('Train data: {}/{}'.format(i, len(train_addrs)) )
            addr = train_addrs[i]
            train_files_ds[i]= train_addrs[i]
    
    with h5py.File(hdf5_path, mode='r') as h5f:
        train_files_ds = h5f['train_files']
        print ('Check h5py data:')
        print (train_files_ds[0])
        print (train_files_ds[-1])
    

    请澄清您想对该数据集执行的操作。你希望它有什么名字。?是否要将文件夹结构复制为HDF5中的组?哪些数据将写入数据集(图像数据?)嗨,我希望数据集包含使用cnn从图像中提取的特征的numpy数组。还包含相应的标签,在此示例中,还包含图像的路径。我使用了create_dataset而不是create_array…如果有帮助的话,我遵循了本教程中关于h5py方面的内容:太棒了,谢谢你的回复。我查看了dt=h5py.special_dtype(vlen=str)并找到了我要找的东西。在我的例子中,我使用了:`if sys.version\u info>=(3,0):string\u type=h5py.special\u dtype(vlen=str)else:string\u type=h5py.special\u dtype(vlen=unicode)`