在hdf5中存储一个字典,其中随机索引作为键,模拟值作为值,可能使用pytorch?
更新问题: nd数组中的每个条目(例如Sim_nDArray)对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel()将其转换为1D等价物。由于我无法从~1亿个条目中搜索,因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值 我已经能够模拟并保存它。但是,我似乎无法正确加载数据。在声明对象“dataset”时,重载“len”时出错 我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗在hdf5中存储一个字典,其中随机索引作为键,模拟值作为值,可能使用pytorch?,pytorch,hdf5,Pytorch,Hdf5,更新问题: nd数组中的每个条目(例如Sim_nDArray)对应于从8D搜索空间中选择的参数组合。我已经使用Sim_nDArray.ravel()将其转换为1D等价物。由于我无法从~1亿个条目中搜索,因此我决定选择~1百万个随机条目。我有相应的约100万个模拟值 我已经能够模拟并保存它。但是,我似乎无法正确加载数据。在声明对象“dataset”时,重载“len”时出错 我计划使用hdf5来存储和读取数据。有人能指导我如何做到这一点吗 def add_trace(arrInd, arr):
def add_trace(arrInd, arr):
""" Add one trace to the dataset, keeping count of the # of traces written """
global ntraces
dset1[ntraces, :] = arrInd
dset2[ntraces, :] = arr
ntraces += 1
def done():
""" After all calls to add_trace_2, trim the dataset to size """
dset1.resize((ntraces, 1000))
dset2.resize((ntraces, 1000))
import torch
from torch.utils.data import Dataset, DataLoader
class Dataset(torch.utils.data.Dataset):
# Characterizes a dataset for PyTorch
def __init__(self, dset1, dset2):
'Initialization'
self.dset1 = dset1
self.dset2 = dset2
self._data_len = len(dset1)
def __len__(self):
# Denotes the total number of samples
return len(self._data_len)
def __getitem__(self, index):
# Generates one sample of data
# Select sample
ID = self.dset1[index]
SimData = self.dset2[index]
return ID, SimData
# Running the main.
if __name__ == '__main__':
import h5py
import numpy as np
import timeit
""" Re-initialize both datasets for the tests """
global data, N, dset1, dset2, ntraces
N = 1000
################ WRITE #############################################################################################
## Creating two datasets
f = h5py.File("randomDataset2.hdf5", 'w')
dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
dtype="float32") # DK: why faster if I do not define chunk
dset1.resize((10001, 1000)) # Allocating extra space
dset2.resize((10001, 1000)) # Allocating extra space
## TEST 1: Less efficient way of writing to hdf5
ntraces = 0
start1 = timeit.default_timer()
for idx in range(N):
IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
DataVec1 = np.random.random(1000)
add_trace(IndxVec1, DataVec1)
done()
# All the program statements
stop1 = timeit.default_timer()
execution_time = stop1 - start1
print("Program Executed in " + str(execution_time)) # It returns time in seconds
f.close()
##################################
## READING HDF files
fr = h5py.File("randomDataset2.hdf5", 'r')
dset10 = fr['dataset1']
dset20 = fr['dataset2']
fr.close()
# Parameters
params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
max_epochs = 100
# Generators
training_set = Dataset(dset10, dset20)
training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)
错误:
(PipInConda_DKU) dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$ python3 main.
py
Program Executed in 2.265893899995717
Traceback (most recent call last):
File "main.py", line 89, in <module>
training_set = Dataset(dset10, dset20)
File "main.py", line 30, in __init__
self._data_len = len(dset1)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 447, in __len__
size = self.len()
File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dat
aset.py", line 459, in len
shape = self.shape
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 286, in shape
return self.id.shape
File "h5py/h5d.pyx", line 132, in h5py.h5d.DatasetID.shape.__get__
File "h5py/h5d.pyx", line 133, in h5py.h5d.DatasetID.shape.__get__
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 289, in h5py.h5d.DatasetID.get_space
ValueError: Not a dataset (not a dataset)
(PipInConda_DKU)dushyant20@DESKTOP-U96RKFC:/mnt/c/PyImageSearch/Sim_Write_n_Read$python3 main。
派克
在2.26589389995717中执行的程序
回溯(最近一次呼叫最后一次):
文件“main.py”,第89行,在
训练集=数据集(dset10、dset20)
文件“main.py”,第30行,在_init中__
自身数据长度=长度(dset1)
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py”,第447行,在__
size=self.len()
文件“/home/dushyant20/miniconda3/envs/pipinonda_DKU/lib/python3.8/site-packages/h5py//u hl/dat
aset.py“,第459行,长度
形状=自我形状
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“/home/dushyant20/miniconda3/envs/PipInConda_DKU/lib/python3.8/site packages/h5py/_hl/dataset.py”,第286行,形状
返回self.id.shape
文件“h5py/h5d.pyx”,第132行,位于h5py.h5d.DatasetID.shape中__
h5py.h5d.DatasetID.shape中的文件“h5py/h5d.pyx”,第133行__
文件“h5py/_objects.pyx”,第54行,在h5py._objects.with_phil.wrapper中
文件“h5py/_objects.pyx”,第55行,在h5py._objects.with_phil.wrapper中
文件“h5py/h5d.pyx”,第289行,在h5py.h5d.DatasetID.get_空间中
ValueError:不是数据集(不是数据集)
我不使用PyTorch,因此无法对此发表评论(或运行整个代码)。意见:
- 我注意到
的两个方法没有正确缩进:类数据集
和def\uu len\uuu(self):
。我想 这是剪切粘贴到SO帖子的错误…但你应该 仔细检查def\uu getitem\uu(self,index):
- 我运行了你的代码(在注释完PyTorch之后),它运行了
完成并使用2 1000x1000创建
数据集。因此,问题不在HDF5文件创建中randomDataset2.hdf5
- 您正在将h5py数据集传递给PyTorch生成器。但是,在调用HDF5文件之前,请先关闭该文件。因此,数据在当时不可用。这可能就是问题所在。另外,生成器是否需要NumPy数组或数据集对象?这也可能导致问题(一旦修复文件关闭问题)
- 使用文件时,请将
上下文管理器一起使用,以避免 开放/关闭问题与/as:
- 建议的做法是将所有导入放在文件的顶部
- 如果希望使用NumPy数组而不是h5py数据集,请使用此调用:
arr10=fr['dataset1'][:]
import h5py
import numpy as np
import timeit
import torch
from torch.utils.data import Dataset, DataLoader
def add_trace(arrInd, arr):
""" Add one trace to the dataset, keeping count of the # of traces written """
global ntraces
dset1[ntraces, :] = arrInd
dset2[ntraces, :] = arr
ntraces += 1
def done():
""" After all calls to add_trace_2, trim the dataset to size """
dset1.resize((ntraces, 1000))
dset2.resize((ntraces, 1000))
class Dataset(torch.utils.data.Dataset):
# Characterizes a dataset for PyTorch
def __init__(self, dset1, dset2):
'Initialization'
self.dset1 = dset1
self.dset2 = dset2
self._data_len = len(dset1)
def __len__(self):
# Denotes the total number of samples
return len(self._data_len)
def __getitem__(self, index):
# Generates one sample of data
# Select sample
ID = self.dset1[index]
SimData = self.dset2[index]
return ID, SimData
# Running the main.
if __name__ == '__main__':
""" Re-initialize both datasets for the tests """
global data, N, dset1, dset2, ntraces
N = 1000
################ WRITE #############################################################################################
## Creating two datasets
with h5py.File("randomDataset2.hdf5", 'w') as f:
dset1 = f.create_dataset('dataset1', (5000, 1000), maxshape=(None, 1000), dtype="float32", chunks=(1, 1000))
dset2 = f.create_dataset('dataset2', (5000, 1000), maxshape=(None, 1000),
dtype="float32") # DK: why faster if I do not define chunk
dset1.resize((10001, 1000)) # Allocating extra space
dset2.resize((10001, 1000)) # Allocating extra space
## TEST 1: Less efficient way of writing to hdf5
ntraces = 0
start1 = timeit.default_timer()
for idx in range(N):
IndxVec1 = np.random.randint(low=0, high=1000, size=1000);
DataVec1 = np.random.random(1000)
add_trace(IndxVec1, DataVec1)
done()
# All the program statements
stop1 = timeit.default_timer()
execution_time = stop1 - start1
print("Program Executed in " + str(execution_time)) # It returns time in seconds
##################################
## READING HDF files
with h5py.File("randomDataset2.hdf5", 'r') as fr:
dset10 = fr['dataset1']
arr10 = fr['dataset1'][:]
dset20 = fr['dataset2']
arr20 = fr['dataset2'][:]
# Parameters
params = {'batch_size': 64, 'shuffle': True, 'num_workers': 6}
max_epochs = 100
# Generators
training_set = Dataset(dset10, dset20)
training_generator = torch.utils.data.DataLoader(training_set, batch_size= 64, shuffle=True, num_workers= 6)
h5py和numpy具有自然映射,将ndarray写入HDF5非常简单(2次调用)1)创建一个文件,然后2)使用
data=your_array
参数创建一个数据集。请添加更多关于指数和模拟值的详细信息。你的数组索引指向什么?或者您是否有一个模拟值数组,并希望从中随机采样?另外,请分享您迄今为止编写的代码(用于生成密钥和数组)。@kcw78感谢您的回复。我更新了问题。