在hdf5中存储一个字典，其中随机索引作为键，模拟值作为值，可能使用pytorch？_Pytorch_Hdf5

在hdf5中存储一个字典，其中随机索引作为键，模拟值作为值，可能使用pytorch？

pytorch

在hdf5中存储一个字典，其中随机索引作为键，模拟值作为值，可能使用pytorch？,pytorch,hdf5,Pytorch,Hdf5,更新问题: nd数组中的每个条目（例如Sim_nDArray）对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel（）将其转换为1D等价物。由于我无法从~1亿个条目中搜索，因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值我已经能够模拟并保存它。但是，我似乎无法正确加载数据。在声明对象“dataset”时，重载“len”时出错我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗 def add_trace(arrInd, arr):

更新问题:

nd数组中的每个条目（例如Sim_nDArray）对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel（）将其转换为1D等价物。由于我无法从~1亿个条目中搜索，因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值

我已经能够模拟并保存它。但是，我似乎无法正确加载数据。在声明对象“dataset”时，重载“len”时出错

我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗

def add_trace(arrInd, arr):
    """ Add one trace to the dataset, keeping count of the # of traces written """
    global ntraces
    dset1[ntraces, :] = arrInd
    dset2[ntraces, :] = arr
    ntraces += 1


def done():
    """ After all calls to add_trace_2, trim the dataset to size """
    dset1.resize((ntraces, 1000))
    dset2.resize((ntraces, 1000))


import torch
from torch.utils.data import Dataset, DataLoader

class Dataset(torch.utils.data.Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, dset1, dset2):
        'Initialization'
        self.dset1 = dset1
        self.dset2 = dset2
        self._data_len = len(dset1)

def __len__(self):
    # Denotes the total number of samples
    return len(self._data_len)

def __getitem__(self, index):
    # Generates one sample of data
    # Select sample
    ID = self.dset1[index]
    SimData = self.dset2[index]
    return ID, SimData


# Running the main.
if __name__ == '__main__':
    import h5py
    import numpy as np
    import timeit

    """ Re-initialize both datasets for the tests """
    global data, N, dset1, dset2, ntraces
    N = 1000
    ################ WRITE #############################################################################################
    ## Creating two datasets
    f = h5py.File("randomDataset2.hdf5", 'w')
    dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
    dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
                             dtype="float32")  # DK: why faster if I do not define chunk

    dset1.resize((10001, 1000))  # Allocating extra space
    dset2.resize((10001, 1000))  # Allocating extra space

    ## TEST 1: Less efficient way of writing to hdf5
    ntraces = 0
    start1 = timeit.default_timer()
    for idx in range(N):
        IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
        DataVec1 = np.random.random(1000)
        add_trace(IndxVec1, DataVec1)
    done()

    # All the program statements
    stop1 = timeit.default_timer()
    execution_time = stop1 - start1
    print("Program Executed in " + str(execution_time))  # It returns time in seconds
    f.close()
    ##################################
    ## READING HDF files
    fr = h5py.File("randomDataset2.hdf5", 'r')
    dset10 = fr['dataset1']
    dset20 = fr['dataset2']
    fr.close()

    # Parameters
    params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
    max_epochs = 100

    # Generators
    training_set = Dataset(dset10, dset20)
    training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)

错误：

(PipInConda_DKU) dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$ python3 main.
py
Program Executed in 2.265893899995717
Traceback (most recent call last):
  File "main.py", line 89, in <module>
    training_set = Dataset(dset10, dset20)
  File "main.py", line 30, in __init__
    self._data_len = len(dset1)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 447, in __len__
    size = self.len()
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 459, in len
    shape = self.shape
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 286, in shape
    return self.id.shape
  File "h5py/h5d.pyx", line 132, in h5py.h5d.DatasetID.shape.__get__
  File "h5py/h5d.pyx", line 133, in h5py.h5d.DatasetID.shape.__get__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 289, in h5py.h5d.DatasetID.get_space
ValueError: Not a dataset (not a dataset)

（PipInConda_DKU）dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$python3 main。
派克
在2.26589389995717中执行的程序
回溯（最近一次呼叫最后一次）：
文件“main.py”，第89行，在
训练集=数据集（dset10、dset20）
文件“main.py”，第30行，在_init中__
自身数据长度=长度（dset1）
文件“h5py/_objects.pyx”，第54行，在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”，第55行，在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py”，第447行，在__
size=self.len（）
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py“，第459行，长度
形状=自我形状
文件“h5py/_objects.pyx”，第54行，在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”，第55行，在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site packages/h5py/_hl/dataset.py”，第286行，形状
返回self.id.shape
文件“h5py/h5d.pyx”，第132行，位于h5py.h5d.DatasetID.shape中__
h5py.h5d.DatasetID.shape中的文件“h5py/h5d.pyx”，第133行__
文件“h5py/_objects.pyx”，第54行，在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”，第55行，在h5py._objects.with_phil.wrapper中
文件“h5py/h5d.pyx”，第289行，在h5py.h5d.DatasetID.get_空间中
ValueError：不是数据集（不是数据集）

我不使用PyTorch，因此无法对此发表评论（或运行整个代码）。意见：

我注意到
```
类数据集
```
的两个方法没有正确缩进：
```
def\uu len\uuu（self）：
```
和
```
def\uu getitem\uu（self，index）：
```
。我想这是剪切粘贴到SO帖子的错误…但你应该仔细检查
我运行了你的代码（在注释完PyTorch之后），它运行了完成并使用2 1000x1000创建
```
randomDataset2.hdf5
```
数据集。因此，问题不在HDF5文件创建中
您正在将h5py数据集传递给PyTorch生成器。但是，在调用HDF5文件之前，请先关闭该文件。因此，数据在当时不可用。这可能就是问题所在。另外，生成器是否需要NumPy数组或数据集对象？这也可能导致问题（一旦修复文件关闭问题）

其他意见：

使用文件时，请将
```
与/as:
```
上下文管理器一起使用，以避免开放/关闭问题
建议的做法是将所有导入放在文件的顶部
如果希望使用NumPy数组而不是h5py数据集，请使用此调用：
```
arr10=fr['dataset1'][：]
```

下面显示了反映上述情况的修改代码。我不知道这是否能解决你的问题……但它可能会为你指明正确的方向

import h5py
import numpy as np
import timeit
import torch
from torch.utils.data import Dataset, DataLoader   
    
def add_trace(arrInd, arr):
    """ Add one trace to the dataset, keeping count of the # of traces written """
    global ntraces
    dset1[ntraces, :] = arrInd
    dset2[ntraces, :] = arr
    ntraces += 1


def done():
    """ After all calls to add_trace_2, trim the dataset to size """
    dset1.resize((ntraces, 1000))
    dset2.resize((ntraces, 1000))

class Dataset(torch.utils.data.Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, dset1, dset2):
        'Initialization'
        self.dset1 = dset1
        self.dset2 = dset2
        self._data_len = len(dset1)

    def __len__(self):
        # Denotes the total number of samples
        return len(self._data_len)
    
    def __getitem__(self, index):
        # Generates one sample of data
        # Select sample
        ID = self.dset1[index]
        SimData = self.dset2[index]
        return ID, SimData


# Running the main.
if __name__ == '__main__':

    """ Re-initialize both datasets for the tests """
    global data, N, dset1, dset2, ntraces
    N = 1000
    ################ WRITE #############################################################################################
    ## Creating two datasets
    with h5py.File("randomDataset2.hdf5", 'w') as f:
        dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
        dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
                                 dtype="float32")  # DK: why faster if I do not define chunk
    
        dset1.resize((10001, 1000))  # Allocating extra space
        dset2.resize((10001, 1000))  # Allocating extra space
    
        ## TEST 1: Less efficient way of writing to hdf5
        ntraces = 0
        start1 = timeit.default_timer()
        for idx in range(N):
            IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
            DataVec1 = np.random.random(1000)
            add_trace(IndxVec1, DataVec1)
        done()
    
        # All the program statements
        stop1 = timeit.default_timer()
        execution_time = stop1 - start1
        print("Program Executed in " + str(execution_time))  # It returns time in seconds

    ##################################
    ## READING HDF files
    with h5py.File("randomDataset2.hdf5", 'r') as fr:
        dset10 = fr['dataset1']
        arr10 = fr['dataset1'][:]
        dset20 = fr['dataset2']
        arr20 = fr['dataset2'][:]

    # Parameters
    params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
    max_epochs = 100

    # Generators
    training_set = Dataset(dset10, dset20)
    training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)

h5py和numpy具有自然映射，将ndarray写入HDF5非常简单（2次调用）1）创建一个文件，然后2）使用

data=your_array

参数创建一个数据集。请添加更多关于指数和模拟值的详细信息。你的数组索引指向什么？或者您是否有一个模拟值数组，并希望从中随机采样？另外，请分享您迄今为止编写的代码（用于生成密钥和数组）。@kcw78感谢您的回复。我更新了问题。