Python 使用dask进行公平和现实的比较_Python_Python 3.x_Dask_H5py

Python 使用dask进行公平和现实的比较

python python-3.x dask

Python 使用dask进行公平和现实的比较,python,python-3.x,dask,h5py,Python,Python 3.x,Dask,H5py,为了更好地了解python中的dask库，我试图对使用dask和不使用dask进行公平的比较。我使用h5py创建了一个大的数据集，稍后用于测量其中一个轴的平均值，作为numpy风格的操作我想知道我所做的是否是一个公平的比较，以检查dask是否可以并行运行代码。我正在阅读这两本书的文档，h5py和dask，所以我想出了这个小实验到目前为止，我所做的是：使用h5py创建（写入）数据集。这是通过使用maxshape和resize替代方法来实现的，以避免将整个日期一次加载到内存中，从而避免内存问题

为了更好地了解python中的

dask

库，我试图对使用dask和不使用dask进行公平的比较。我使用

h5py

创建了一个大的数据集，稍后用于测量其中一个轴的平均值，作为numpy风格的操作

我想知道我所做的是否是一个公平的比较，以检查dask是否可以并行运行代码。我正在阅读这两本书的文档，

h5py

和

dask

，所以我想出了这个小实验

到目前为止，我所做的是：

使用h5py创建（写入）数据集。这是通过使用

maxshape

和

resize

替代方法来实现的，以避免将整个日期一次加载到内存中，从而避免内存问题

使用“经典代码”估计一个轴上的简单操作（测量平均值），这意味着每1000行估计平均值

重复上一步，但这次使用dask

这是我到目前为止得到的第一步：

# Write h5 dataset
chunks = (100,500,2)
tp = time.time()
with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
    # create a 3D dataset inside one h5py file
    dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
    print(dset.shape)
    while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
        dset.resize(dset.shape[0]+10**4, axis=0)  # resize data
        print(dset.shape) # check new shape for each append
        dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
    tmp  = time.time()- tp
    print('Writting time: {}'.format(tmp))

我想我遗漏了一些程序，似乎dask时间执行比经典方法需要更多的时间。此外，我认为块选项可能是原因，因为我对

h5

数据集和

dask

使用了相同的块大小

欢迎就这一程序提出任何建议

# Classical read
tp = time.time()
filename = 'path/3D_matrix_1.hdf5'
with h5py.File(filename, mode='r') as f:
    # List all groups (actually there is only one)
    a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
    # Get the data
    result = f.get(a_group_key)
    print(result.shape)
    #print(type(result))
    # read each 1000 elements
    start_ = 0 # initialize a start counter
    means = []
    while start_ < result.shape[0]:
        arr = np.array(result[start_:start_+1000])
        m = arr.mean()
        #print(m)
        means.append(m)
        start_ += 1000
    final_mean = np.array(means).mean()
    print(final_mean, len(final_mean))
    tmp  = time.time()- tp
    print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))

# Dask way
from dask import delayed
tp = time.time()
import dask.array as da
filename = 'path/3D_matrix_1.hdf5'
dset = h5py.File(filename, 'r')
dataset_names = list(dset.keys())[0] # to obtain dataset content
result = dset.get(dataset_names)
array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
print('Gigabytes del input: {}'.format(array.nbytes / 1e9))  # Gigabytes of the input processed lazily
x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
print('Mean array: {}'.format(x.compute())) 
tmp = time.time() - tp
print('Total reading and measuring time with dask: {:.2f}'.format(tmp))