Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/firebase/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在hdf5中存储一个字典,其中随机索引作为键,模拟值作为值,可能使用pytorch?_Pytorch_Hdf5 - Fatal编程技术网

在hdf5中存储一个字典,其中随机索引作为键,模拟值作为值,可能使用pytorch?

在hdf5中存储一个字典,其中随机索引作为键,模拟值作为值,可能使用pytorch?,pytorch,hdf5,Pytorch,Hdf5,更新问题: nd数组中的每个条目(例如Sim_nDArray)对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel()将其转换为1D等价物。由于我无法从~1亿个条目中搜索,因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值 我已经能够模拟并保存它。但是,我似乎无法正确加载数据。在声明对象“dataset”时,重载“len”时出错 我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗 def add_trace(arrInd, arr):

更新问题:

nd数组中的每个条目(例如Sim_nDArray)对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel()将其转换为1D等价物。由于我无法从~1亿个条目中搜索,因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值

我已经能够模拟并保存它。但是,我似乎无法正确加载数据。在声明对象“dataset”时,重载“len”时出错

我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗

def add_trace(arrInd, arr):
    """ Add one trace to the dataset, keeping count of the # of traces written """
    global ntraces
    dset1[ntraces, :] = arrInd
    dset2[ntraces, :] = arr
    ntraces += 1


def done():
    """ After all calls to add_trace_2, trim the dataset to size """
    dset1.resize((ntraces, 1000))
    dset2.resize((ntraces, 1000))


import torch
from torch.utils.data import Dataset, DataLoader

class Dataset(torch.utils.data.Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, dset1, dset2):
        'Initialization'
        self.dset1 = dset1
        self.dset2 = dset2
        self._data_len = len(dset1)

def __len__(self):
    # Denotes the total number of samples
    return len(self._data_len)

def __getitem__(self, index):
    # Generates one sample of data
    # Select sample
    ID = self.dset1[index]
    SimData = self.dset2[index]
    return ID, SimData


# Running the main.
if __name__ == '__main__':
    import h5py
    import numpy as np
    import timeit

    """ Re-initialize both datasets for the tests """
    global data, N, dset1, dset2, ntraces
    N = 1000
    ################ WRITE #############################################################################################
    ## Creating two datasets
    f = h5py.File("randomDataset2.hdf5", 'w')
    dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
    dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
                             dtype="float32")  # DK: why faster if I do not define chunk

    dset1.resize((10001, 1000))  # Allocating extra space
    dset2.resize((10001, 1000))  # Allocating extra space

    ## TEST 1: Less efficient way of writing to hdf5
    ntraces = 0
    start1 = timeit.default_timer()
    for idx in range(N):
        IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
        DataVec1 = np.random.random(1000)
        add_trace(IndxVec1, DataVec1)
    done()

    # All the program statements
    stop1 = timeit.default_timer()
    execution_time = stop1 - start1
    print("Program Executed in " + str(execution_time))  # It returns time in seconds
    f.close()
    ##################################
    ## READING HDF files
    fr = h5py.File("randomDataset2.hdf5", 'r')
    dset10 = fr['dataset1']
    dset20 = fr['dataset2']
    fr.close()

    # Parameters
    params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
    max_epochs = 100

    # Generators
    training_set = Dataset(dset10, dset20)
    training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)
错误:

(PipInConda_DKU) dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$ python3 main.
py
Program Executed in 2.265893899995717
Traceback (most recent call last):
  File "main.py", line 89, in <module>
    training_set = Dataset(dset10, dset20)
  File "main.py", line 30, in __init__
    self._data_len = len(dset1)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 447, in __len__
    size = self.len()
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 459, in len
    shape = self.shape
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 286, in shape
    return self.id.shape
  File "h5py/h5d.pyx", line 132, in h5py.h5d.DatasetID.shape.__get__
  File "h5py/h5d.pyx", line 133, in h5py.h5d.DatasetID.shape.__get__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 289, in h5py.h5d.DatasetID.get_space
ValueError: Not a dataset (not a dataset)
(PipInConda_DKU)dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$python3 main。
派克
在2.26589389995717中执行的程序
回溯(最近一次呼叫最后一次):
文件“main.py”,第89行,在
训练集=数据集(dset10、dset20)
文件“main.py”,第30行,在_init中__
自身数据长度=长度(dset1)
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py”,第447行,在__
size=self.len()
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py“,第459行,长度
形状=自我形状
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site packages/h5py/_hl/dataset.py”,第286行,形状
返回self.id.shape
文件“h5py/h5d.pyx”,第132行,位于h5py.h5d.DatasetID.shape中__
h5py.h5d.DatasetID.shape中的文件“h5py/h5d.pyx”,第133行__
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“h5py/h5d.pyx”,第289行,在h5py.h5d.DatasetID.get_空间中
ValueError:不是数据集(不是数据集)

我不使用PyTorch,因此无法对此发表评论(或运行整个代码)。意见:

  • 我注意到
    类数据集
    的两个方法没有正确缩进:
    def\uu len\uuu(self):
    def\uu getitem\uu(self,index):
    。我想 这是剪切粘贴到SO帖子的错误…但你应该 仔细检查
  • 我运行了你的代码(在注释完PyTorch之后),它运行了 完成并使用2 1000x1000创建
    randomDataset2.hdf5
    数据集。因此,问题不在HDF5文件创建中
  • 您正在将h5py数据集传递给PyTorch生成器。但是,在调用HDF5文件之前,请先关闭该文件。因此,数据在当时不可用。这可能就是问题所在。另外,生成器是否需要NumPy数组或数据集对象?这也可能导致问题(一旦修复文件关闭问题)
其他意见:

  • 使用文件时,请将
    与/as:
    上下文管理器一起使用,以避免 开放/关闭问题
  • 建议的做法是将所有导入放在文件的顶部
  • 如果希望使用NumPy数组而不是h5py数据集,请使用此调用:
    arr10=fr['dataset1'][:]
下面显示了反映上述情况的修改代码。 我不知道这是否能解决你的问题……但它可能会为你指明正确的方向

import h5py
import numpy as np
import timeit
import torch
from torch.utils.data import Dataset, DataLoader   
    
def add_trace(arrInd, arr):
    """ Add one trace to the dataset, keeping count of the # of traces written """
    global ntraces
    dset1[ntraces, :] = arrInd
    dset2[ntraces, :] = arr
    ntraces += 1


def done():
    """ After all calls to add_trace_2, trim the dataset to size """
    dset1.resize((ntraces, 1000))
    dset2.resize((ntraces, 1000))

class Dataset(torch.utils.data.Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, dset1, dset2):
        'Initialization'
        self.dset1 = dset1
        self.dset2 = dset2
        self._data_len = len(dset1)

    def __len__(self):
        # Denotes the total number of samples
        return len(self._data_len)
    
    def __getitem__(self, index):
        # Generates one sample of data
        # Select sample
        ID = self.dset1[index]
        SimData = self.dset2[index]
        return ID, SimData


# Running the main.
if __name__ == '__main__':

    """ Re-initialize both datasets for the tests """
    global data, N, dset1, dset2, ntraces
    N = 1000
    ################ WRITE #############################################################################################
    ## Creating two datasets
    with h5py.File("randomDataset2.hdf5", 'w') as f:
        dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
        dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
                                 dtype="float32")  # DK: why faster if I do not define chunk
    
        dset1.resize((10001, 1000))  # Allocating extra space
        dset2.resize((10001, 1000))  # Allocating extra space
    
        ## TEST 1: Less efficient way of writing to hdf5
        ntraces = 0
        start1 = timeit.default_timer()
        for idx in range(N):
            IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
            DataVec1 = np.random.random(1000)
            add_trace(IndxVec1, DataVec1)
        done()
    
        # All the program statements
        stop1 = timeit.default_timer()
        execution_time = stop1 - start1
        print("Program Executed in " + str(execution_time))  # It returns time in seconds

    ##################################
    ## READING HDF files
    with h5py.File("randomDataset2.hdf5", 'r') as fr:
        dset10 = fr['dataset1']
        arr10 = fr['dataset1'][:]
        dset20 = fr['dataset2']
        arr20 = fr['dataset2'][:]

    # Parameters
    params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
    max_epochs = 100

    # Generators
    training_set = Dataset(dset10, dset20)
    training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)

h5py和numpy具有自然映射,将ndarray写入HDF5非常简单(2次调用)1)创建一个文件,然后2)使用
data=your_array
参数创建一个数据集。请添加更多关于指数和模拟值的详细信息。你的数组索引指向什么?或者您是否有一个模拟值数组,并希望从中随机采样?另外,请分享您迄今为止编写的代码(用于生成密钥和数组)。@kcw78感谢您的回复。我更新了问题。