Python 使用dask进行公平和现实的比较
为了更好地了解python中的Python 使用dask进行公平和现实的比较,python,python-3.x,dask,h5py,Python,Python 3.x,Dask,H5py,为了更好地了解python中的dask库,我试图对使用dask和不使用dask进行公平的比较。我使用h5py创建了一个大的数据集,稍后用于测量其中一个轴的平均值,作为numpy风格的操作 我想知道我所做的是否是一个公平的比较,以检查dask是否可以并行运行代码。我正在阅读这两本书的文档,h5py和dask,所以我想出了这个小实验 到目前为止,我所做的是: 使用h5py创建(写入)数据集。这是通过使用maxshape和resize替代方法来实现的,以避免将整个日期一次加载到内存中,从而避免内存问题
dask
库,我试图对使用dask和不使用dask进行公平的比较。我使用h5py
创建了一个大的数据集,稍后用于测量其中一个轴的平均值,作为numpy风格的操作
我想知道我所做的是否是一个公平的比较,以检查dask是否可以并行运行代码。我正在阅读这两本书的文档,h5py
和dask
,所以我想出了这个小实验
到目前为止,我所做的是:
maxshape
和resize
替代方法来实现的,以避免将整个日期一次加载到内存中,从而避免内存问题# Write h5 dataset
chunks = (100,500,2)
tp = time.time()
with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
# create a 3D dataset inside one h5py file
dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
print(dset.shape)
while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
dset.resize(dset.shape[0]+10**4, axis=0) # resize data
print(dset.shape) # check new shape for each append
dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
tmp = time.time()- tp
print('Writting time: {}'.format(tmp))
我想我遗漏了一些程序,似乎dask时间执行比经典方法需要更多的时间。此外,我认为块选项可能是原因,因为我对h5
数据集和dask
使用了相同的块大小
欢迎就这一程序提出任何建议
# Classical read
tp = time.time()
filename = 'path/3D_matrix_1.hdf5'
with h5py.File(filename, mode='r') as f:
# List all groups (actually there is only one)
a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
# Get the data
result = f.get(a_group_key)
print(result.shape)
#print(type(result))
# read each 1000 elements
start_ = 0 # initialize a start counter
means = []
while start_ < result.shape[0]:
arr = np.array(result[start_:start_+1000])
m = arr.mean()
#print(m)
means.append(m)
start_ += 1000
final_mean = np.array(means).mean()
print(final_mean, len(final_mean))
tmp = time.time()- tp
print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))
# Dask way
from dask import delayed
tp = time.time()
import dask.array as da
filename = 'path/3D_matrix_1.hdf5'
dset = h5py.File(filename, 'r')
dataset_names = list(dset.keys())[0] # to obtain dataset content
result = dset.get(dataset_names)
array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
print('Gigabytes del input: {}'.format(array.nbytes / 1e9)) # Gigabytes of the input processed lazily
x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
print('Mean array: {}'.format(x.compute()))
tmp = time.time() - tp
print('Total reading and measuring time with dask: {:.2f}'.format(tmp))