批读和批写，在python中从textfile到HDF5_Python_Numpy_Hdf5

批读和批写，在python中从textfile到HDF5

python numpy

批读和批写，在python中从textfile到HDF5,python,numpy,hdf5,Python,Numpy,Hdf5,目标是向Tensorflow提供大型数据集。我谈到了以下实现。然而，虽然HDF5的io应该非常快，但我的实现速度很慢。这是因为没有使用chunks函数吗？我似乎没有得到块的正确维度，我应该将其视为第三维度。喜欢（4096,71000）对于chunksize 1000 请注意，我可以通过为单个生成器找到解决方案来简化下面的代码。但是，我认为数据/标签组合非常常见，对其他人也很有用我使用以下函数创建两个生成器，一个用于数据，另一个用于相应的标签 def read_chunks(file, dim,

目标是向Tensorflow提供大型数据集。我谈到了以下实现。然而，虽然HDF5的io应该非常快，但我的实现速度很慢。这是因为没有使用chunks函数吗？我似乎没有得到块的正确维度，我应该将其视为第三维度。喜欢（4096,71000）对于chunksize 1000

请注意，我可以通过为单个生成器找到解决方案来简化下面的代码。但是，我认为数据/标签组合非常常见，对其他人也很有用

我使用以下函数创建两个生成器，一个用于数据，另一个用于相应的标签

def read_chunks(file, dim, batch_size=batch_size):
    chunk = np.empty(dim,)
    current_size = 1
    # read input file line by line
    for line in file:
        current_size += 1
        # build chunk
        chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))
        # reaches batch size
        if current_size == batch_size:
            yield chunk
            # reset counters
            current_size = 1
            chunk = np.empty(dim,)

然后我希望将这些生成器生成的数据和标签移动到HDF5

def write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype):
    # remove existing file
    if os.path.isfile(out_file):
        os.remove(out_file)
    with h5py.File(out_file, 'a') as f:
        # create a dataset and labelset in the same file
        d = f.create_dataset('data', (batch_size,data_dim), maxshape=(None,data_dim), dtype=data_dtype)
        l = f.create_dataset('label', (batch_size,label_dim), maxshape=(None,label_dim), dtype=label_dtype)
        # use generators to fill both sets
        for data in data_gen:
            d.resize(d.shape[0]+batch_size, axis=0)
            d[-batch_size:] = data
            l.resize(l.shape[0]+batch_size, axis=0)
            l[-batch_size:] = next(label_gen)

使用以下常量，我将两个函数组合为：

batch_size = 4096
h5_batch_size = 1000
data_dim = 7 #[NUM_POINT, 9]
label_dim = 1 #[NUM_POINT]
data_dtype = 'float32'
label_dtype = 'uint8'

for data_file, label_file in data_label_files:
    print(data_file)
    with open(data_file, 'r') as data_f, open(label_file, 'r') as label_f:
        data_gen = read_chunks(data_f, dim=data_dim)
        label_gen = read_chunks(label_f, dim=label_dim)
        out_file = data_file[:-4] + '.h5'
        write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype)

问题不是HDF5太慢。问题是，您使用Python循环一次读取一行，每行调用

genfromtxt（）

一次！该函数用于读取整个文件。然后在同一个循环中使用“array=vstack（array，newstuff）`的反模式

简而言之，您的性能问题从这里开始：

    chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))

您应该一次读取整个文件。如果您不能这样做，请读取其中的一半（您可以设置每次读取的最大行数，例如100万）。

问题不是HDF5太慢。问题是您使用Python循环一次读取一行，调用

genfromtxt（）

每行一次！该函数用于读取整个文件。然后在同一循环中使用“array=vstack（array，newstuff）`的反模式

简而言之，您的性能问题从这里开始：

    chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))

您应该一次读取整个文件。如果你做不到这一点，请读取其中的一半（你可以设置每次读取的最大行数，例如100万）。

我尝试读取块，但这将读取指定数量的字节。给我留下半行，这会导致不同列数的错误。你有阅读n行的建议吗？另一个解决方案是将chunk制作成一个简单的列表，并将

np.genfromtxt（）

应用于此列表，对吗？然而，这给

ìo.BytesIO（）

和

.encode（）

带来了问题。您对如何实现这两个方面有什么建议吗？目前，我通过将行写入一个列表（称为：chunk）并在此列表上执行

np.genfromtxt（）

，解决了这个问题。字节编码的问题通过在

'rb'

模式下读取文件得到解决。我尝试读取块，但这将读取指定数量的字节。给我留下半行，这会导致不同列数的错误。你有阅读n行的建议吗？另一个解决方案是将chunk制作成一个简单的列表，并将

np.genfromtxt（）

应用于此列表，对吗？然而，这给

ìo.BytesIO（）

和

.encode（）

带来了问题。您对如何实现这两个方面有什么建议吗？目前，我通过将行写入一个列表（称为：chunk）并在此列表上执行

np.genfromtxt（）

，解决了这个问题。字节编码的问题通过在

'rb'

模式下读取文件得到解决。