Python 附加到h5文件_Python_Hdf5_H5py

Python 附加到h5文件

python

Python 附加到h5文件,python,hdf5,h5py,Python,Hdf5,H5py,我有一个h5文件，其中包含如下数据集： col1. col2. col3 1 3 5 5 4 9 6 8 0 7 2 5 2 1 2 我有另一个h5文件，具有相同的列： col1. col2. col3 6 1 9

我有一个h5文件，其中包含如下数据集：

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2

我有另一个h5文件，具有相同的列：

col1.      col2.      col3
 6           1          9
 8           2          7

我想将这两个连接起来，得到以下h5文件：

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2
 6           1          9
 8           2          7

如果文件很大，或者我们有很多这样的合并，那么最有效的方法是什么呢？

我对pandas不太熟悉，所以在这方面我无能为力。这可以通过h5py或pytables完成。正如@hpaulj所提到的，该过程将数据集读取到一个numpy数组中，然后使用h5py写入HDF5数据集。确切的过程取决于maxshape属性（它控制数据集是否可以调整大小）

我创建了示例来展示这两种方法（固定大小或可调整大小的数据集）。第一个方法创建一个新的file3，它组合了file1和file2中的值。第二种方法将值从file2添加到file1e（可调整大小）。注意：创建示例中使用的文件的代码位于末尾

我有一个较长的答案，以便显示复制数据的所有方法。
看看这个答案：

方法1：将数据集合并到新文件中
未使用

maxshape=

参数创建数据集时需要

with h5py.File('file1.h5','r') as h5f1,  \
     h5py.File('file2.h5','r') as h5f2,  \
     h5py.File('file3.h5','w') as h5f3 :
         
    print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    

    arr1_a0 = h5f1['ds_1'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0]            
    arr3_a0 = arr1_a0 + arr2_a0          
    h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype,
                        shape=(arr3_a0,3), maxshape=(None,3))

    xfer_arr1 = h5f1['ds_1']               
    h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1
 
    xfer_arr2 = h5f2['ds_2']   
    h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)

with h5py.File('file1e.h5','r+') as h5f1, \
     h5py.File('file2.h5','r') as h5f2 :

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    
    
    arr1_a0 = h5f1['ds_1e'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0] 
    arr3_a0 = arr1_a0 + arr2_a0          

    h5f1['ds_1e'].resize(arr3_a0,axis=0)
    
    xfer_arr2 = h5f2['ds_2']   
    h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)

方法2：将文件2数据集附加到文件1数据集
必须使用

maxshape=

参数创建file1e中的数据集

with h5py.File('file1.h5','r') as h5f1, \ h5py.File('file2.h5','r') as h5f2, \ h5py.File('file3.h5','w') as h5f3 : print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape) print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape) arr1_a0 = h5f1['ds_1'].shape[0] arr2_a0 = h5f2['ds_2'].shape[0] arr3_a0 = arr1_a0 + arr2_a0 h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype, shape=(arr3_a0,3), maxshape=(None,3)) xfer_arr1 = h5f1['ds_1'] h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1 xfer_arr2 = h5f2['ds_2'] h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2 print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)

with h5py.File('file1e.h5','r+') as h5f1, \ h5py.File('file2.h5','r') as h5f2 : print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape) print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape) arr1_a0 = h5f1['ds_1e'].shape[0] arr2_a0 = h5f2['ds_2'].shape[0] arr3_a0 = arr1_a0 + arr2_a0 h5f1['ds_1e'].resize(arr3_a0,axis=0) xfer_arr2 = h5f2['ds_2'] h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2 print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
创建上述示例文件的代码：

import h5py import numpy as np arr1 = np.array([[ 1, 3, 5 ], [ 5, 4, 9 ], [ 6, 8, 0 ], [ 7, 2, 5 ], [ 2, 1, 2 ]] ) with h5py.File('file1.h5','w') as h5f: h5f.create_dataset('ds_1',data=arr1) print (h5f['ds_1'].maxshape) with h5py.File('file1e.h5','w') as h5f: h5f.create_dataset('ds_1e',data=arr1, shape=(5,3), maxshape=(None,3)) print (h5f['ds_1e'].maxshape) arr2 = np.array([[ 6, 1, 9 ], [ 8, 2, 7 ]] ) with h5py.File('file2.h5','w') as h5f: h5f.create_dataset('ds_2',data=arr2)

h5_1.附加（h5_2）
？它们是熊猫数据帧吗？如果是，则
h5_concat=pandas.concat（h5_1，h5_2）
。随着时间的推移，这不是合并。这是连接，它们不是数据帧。它们是两个h5文件。
pd.concat（[h5_1，h5_2]，axis=0）
@wwnde您是否建议先将h5文件转换为熊猫数据帧？h5文件将数据存储在数据集中
h5f1.keys（）
生成根级别的对象名称列表。在您的例子中，它们是名为“col1”、“col2”、“col3”的数据集。
h5f2.keys（）
是否产生相同的名称？如果是，是否要将
h5f2['col1']
到
h5f1['col1']
的数据与'col2'和'col3'的数据合并？如果是这样的话，那么对于3个数据集也是同样的过程。我是否需要修改我的示例来演示如何迭代键/数据集？这将“稍微复杂一些”。谢谢你的回答。请告诉我是否有任何方法可以直接将
h5f2['col1']
附加到
h5f1['col1']
中，而不是创建一个新的数据集作为
h5f3['col1']
并将这两个数据集顺序添加到其中？示例的第二部分就是这样做的。它以追加模式打开
'file1e.h5'
：
r+
，调整数据集的大小，然后追加
'file2.h5'
中的数据。附加到数据集需要在最初创建数据集时将其定义为“可调整大小”（使用示例中所示的
maxshape=
参数）。0轴的值必须为：a）
None
允许无限大小，或b）大于
h5f1['col1']
和
h5f2['col1']
之和的值。您需要检查文件中所有3个数据集的此属性。第二部分是我要查找的。非常感谢你的帮助。