Python 使用astype在H5py中创建对HDF数据集的引用_Python_Numpy_H5py_Hdf

Python 使用astype在H5py中创建对HDF数据集的引用

python numpy

Python 使用astype在H5py中创建对HDF数据集的引用,python,numpy,h5py,hdf,Python,Numpy,H5py,Hdf,从中，我看到可以使用数据集的astype方法将HDF数据集转换为另一种类型。这将返回一个动态执行转换的contextmanager 但是，我希望读入存储为uint16的数据集，然后将其转换为float32类型。此后，我想用另一个函数从这个数据集中提取不同的片段，作为cast类型float32。文档解释了它的用法 with dataset.astype('float32'): castdata = dataset[:] 这将导致整个数据集被读入并转换为float32，这不是我想要的。我希望

从中，我看到可以使用数据集的

astype

方法将HDF数据集转换为另一种类型。这将返回一个动态执行转换的contextmanager

但是，我希望读入存储为

uint16

的数据集，然后将其转换为

float32

类型。此后，我想用另一个函数从这个数据集中提取不同的片段，作为cast类型

float32

。文档解释了它的用法

with dataset.astype('float32'):
   castdata = dataset[:]

这将导致整个数据集被读入并转换为

float32

，这不是我想要的。我希望有一个对数据集的引用，但将其转换为相当于

numpy.astype

的

float32

。如何创建对

.astype（'float32'）

对象的引用，以便将其传递给另一个函数使用

例如：

import h5py as HDF
import numpy as np
intdata = (100*np.random.random(10)).astype('uint16')

# create the HDF dataset
def get_dataset_as_float():
    hf = HDF.File('data.h5', 'w')
    d = hf.create_dataset('data', data=intdata)
    print(d.dtype)
    # uint16

    with d.astype('float32'):
    # This won't work since the context expires. Returns a uint16 dataset reference
       return d

    # this works but causes the entire dataset to be read & converted
    # with d.astype('float32'):
    #   return d[:]

此外，astype上下文似乎仅在访问数据元素时才适用。这意味着

def use_data():
   d = get_data_as_float()
   # this is a uint16 dataset

   # try to use it as a float32
   with d.astype('float32'):
       print(np.max(d))   # --> output is uint16
       print(np.max(d[:]))   # --> output is float32, but entire data is loaded

因此，难道没有一种numpy风格的使用astype的方法吗？

astype的文档似乎暗示将其全部读入一个新位置是其目的。因此，如果要在不同的场合重复使用具有许多功能的浮动铸件，则返回d[：]是最合理的

如果你知道你需要什么样的演员，并且只需要一次，你可以改变一些事情，做一些类似的事情：

def get_dataset_as_float(intdata, *funcs):
    with HDF.File('data.h5', 'w') as hf:
        d = hf.create_dataset('data', data=intdata)
        with d.astype('float32'):
            d2 = d[...]
            return tuple(f(d2) for f in funcs)

在任何情况下，您都需要确保在离开函数之前关闭

hf

，否则以后会遇到问题

通常，我建议将数据集的转换和加载/创建完全分离，并将数据集作为函数的参数之一传递

上述内容可称为：

In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)

d.astype（）

返回一个

AstypeContext

对象。如果查看

AstypeContext

的源代码，您将更好地了解发生了什么：

class AstypeContext(object):

    def __init__(self, dset, dtype):
        self._dset = dset
        self._dtype = numpy.dtype(dtype)

    def __enter__(self):
        self._dset._local.astype = self._dtype

    def __exit__(self, *args):
        self._dset._local.astype = None

当您输入

AstypeContext

时，数据集的

\u local.astype

属性将更新为新的所需类型，当您退出上下文时，它将更改回其原始值

因此，您可以或多或少地获得您想要的行为，如下所示：

def get_dataset_as_type(d, dtype='float32'):

    # creates a new Dataset instance that points to the same HDF5 identifier
    d_new = HDF.Dataset(d.id)

    # set the ._local.astype attribute to the desired output type
    d_new._local.astype = np.dtype(dtype)

    return d_new

当您现在从

dunew

读取时，您将返回

float32

numpy数组，而不是

uint16

：

d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')

print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81.,  65.,  33.,  22.,  67.,  57.,  94.,  63.,  89.,  68.], dtype=float32)

print(d.dtype, d_new.dtype)
# uint16, uint16

请注意，这不会更新

d_new

的

.dtype

属性（该属性似乎是不可变的）。如果您还想更改

dtype

属性，您可能需要将

h5py.Dataset

子类化才能做到这一点。

我不认为

np.max（d）

在这方面做得特别聪明。由于

没有自己的

max（）

方法，

np.max（）

将数组读入内存并调用

np.core.umath.max.reduce（）

，使用

d.dtype

设置输出类型。

np.max（d）

和

np.max（d[：]）

的计时几乎相同。@ali\m您可能是对的。我只是选择np.max作为查看数组上的操作是否返回数据类型的方法。这对我的计算不重要。我将主要提取与我一起工作的切片。我确实研究了AsTypeContext，但不确定自己设置数据类型是否会产生一些不良后果。我会做一些测试，然后回到这个答案。谢谢。@arjmage我想应该没问题

d._local

是一个对象，因此您的更改应该是线程安全的。您可以看到

d._local.dtype

仅用于设置读取数据的输出numpy数组的

dtype

，这是实际HDF5对象的标识符。