Python 加快将数百个3D numpy阵列写入hdf5文件的速度_Python_Numpy_H5py_Joblib

Python 加快将数百个3D numpy阵列写入hdf5文件的速度

python numpy

Python 加快将数百个3D numpy阵列写入hdf5文件的速度,python,numpy,h5py,joblib,Python,Numpy,H5py,Joblib,我正在开发的应用程序将包含多个图像（视野）、平面（Z）和荧光通道的专有tiff文件格式（尼康nd2文件）转换为numpy阵列，然后保存在HDF5文件中。通常一个典型数据集有50个视场（fov），每个视场有5个通道，每个通道有40个z平面）。整个文件的容量大约为6GB 这是我写的代码：步骤： 0）导入所有必需的库 import nd2reader as nd2 from matplotlib import pyplot as plt import numpy as np import h5py

我正在开发的应用程序将包含多个图像（视野）、平面（Z）和荧光通道的专有tiff文件格式（尼康nd2文件）转换为numpy阵列，然后保存在HDF5文件中。通常一个典型数据集有50个视场（fov），每个视场有5个通道，每个通道有40个z平面）。整个文件的容量大约为6GB

这是我写的代码：

步骤： 0）导入所有必需的库

import nd2reader as nd2
from matplotlib import pyplot as plt
import numpy as np
import h5py as h5
import itertools
import ast
import glob as glob
from joblib import Parallel, delayed
import time

1）用于运行nd2文件转换的函数。到numpy数组的转换是使用一个python程序nd2reader完成的，速度很快。为了减少循环的数量并使用列表理解，我制作了一个元组列表，每个元组包含通道和fov 例子： [（'DAPI'，0），（'DAPI'，1）] 其中DAPI是通道，fov是编号

注意：实验频道列表是一个包含字典的文件，该字典将频道（键）与感兴趣的基因（值）匹配

2）函数将图像组合成3D阵列，然后写入HDF5文件。我用h5py。我在生成每个3D numpy阵列后立即将其写入磁盘

def SaveImg(DataFile,ImgRef,ExperimentChannelList,ImgStack,*args):
        channel=args[0]
        fov=args[1]
        for idx,image in enumerate(ImgRef.select(channels=channel,z_levels=ImgRef.z_levels,fields_of_view=fov)):
            ImgStack[idx,:,:]=image
        gene=ExperimentChannelList[channel]
        ChannelGroup=DataFile.require_group(gene)
        FovDataSet=ChannelGroup.create_dataset(str(fov), data=ImgStack,dtype=np.float64,compression="gzip")

3）脚本主体和joblib调用并行处理目录中的所有文件

if __name__=='__main__':

    # Run the
    # Directory where ND2 file is stored (Ex. User/Data/)
    WorkingDirectory=input('Enter the directory with the files to process (ex. /User/):  ')
    #WorkingDirectory='/Users/simone/Box Sync/test/ND2conversion/'
    NumberOfProcesses=int(input('Enter the number of processes to use:  '))
    #NumberOfProcesses=2
    FileExt='nd2'
    # Iterator with the name of the files to process
    FilesIter=glob.iglob(WorkingDirectory+'*.'+FileExt)  

    now = time.time()
    Parallel(n_jobs=NumberOfProcesses,verbose=5)(delayed(ConvertND2File)(ND2file) for ND2file in FilesIter)
    print("Finished in", time.time()-now , "sec")

运行时间转换两个5.9 Gb文件的总时间
[并行（n_作业=2）]：完成了2个作业中的1个；已用时间：剩余7.4分钟：7.4分钟
[并行（n_作业=2）]：完成2个作业中的2个；经过时间：完成7.4分钟
以444.8717038631439秒完成

问题: 我只是想知道是否有更好的方法来处理io到hdf5文件，以加快转换，考虑到如果我想扩大进程，我将无法在内存中保留所有3D numpy阵列（fov），然后在每个通道处理后写入它们。

谢谢

次要的一点是，对通道_字段中的x使用

的列表理解是可行的，但可读性不如常规for循环。由于SaveImg
步骤（可能）相对昂贵，列表理解并不能节省时间。为什么要在SaveImg
之外初始化ImgStack
？从一次调用到下一次调用没有结转SaveImg
是否存在？如果我的问题误读了代码，一部分是因为我没有仔细研究它，另一部分是因为代码有点晦涩。你可以看看PyTables（）-它是用来处理大于RAM大小的数据集的。谢谢你的帮助！我按照@hpaulj的建议修改了代码。阅读完PyTables中的文档后，我更改了使用的压缩算法（从gzip
更改为lzf
）。从7分钟的图像时间下降到180万，这是一个非常大的进步<代码>[并行（n_作业=2）]：完成2次中的1次；已用时间：剩余1.8分钟：1.8分钟[并行（n_作业=2）]：完成2次中的2次；已用时间：1.8分钟完成109.01764011383057秒

if __name__=='__main__':

    # Run the
    # Directory where ND2 file is stored (Ex. User/Data/)
    WorkingDirectory=input('Enter the directory with the files to process (ex. /User/):  ')
    #WorkingDirectory='/Users/simone/Box Sync/test/ND2conversion/'
    NumberOfProcesses=int(input('Enter the number of processes to use:  '))
    #NumberOfProcesses=2
    FileExt='nd2'
    # Iterator with the name of the files to process
    FilesIter=glob.iglob(WorkingDirectory+'*.'+FileExt)  

    now = time.time()
    Parallel(n_jobs=NumberOfProcesses,verbose=5)(delayed(ConvertND2File)(ND2file) for ND2file in FilesIter)
    print("Finished in", time.time()-now , "sec")