Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/325.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用dask进行公平和现实的比较_Python_Python 3.x_Dask_H5py - Fatal编程技术网

Python 使用dask进行公平和现实的比较

Python 使用dask进行公平和现实的比较,python,python-3.x,dask,h5py,Python,Python 3.x,Dask,H5py,为了更好地了解python中的dask库,我试图对使用dask和不使用dask进行公平的比较。我使用h5py创建了一个大的数据集,稍后用于测量其中一个轴的平均值,作为numpy风格的操作 我想知道我所做的是否是一个公平的比较,以检查dask是否可以并行运行代码。我正在阅读这两本书的文档,h5py和dask,所以我想出了这个小实验 到目前为止,我所做的是: 使用h5py创建(写入)数据集。这是通过使用maxshape和resize替代方法来实现的,以避免将整个日期一次加载到内存中,从而避免内存问题

为了更好地了解python中的
dask
库,我试图对使用dask和不使用dask进行公平的比较。我使用
h5py
创建了一个大的数据集,稍后用于测量其中一个轴的平均值,作为numpy风格的操作

我想知道我所做的是否是一个公平的比较,以检查dask是否可以并行运行代码。我正在阅读这两本书的文档,
h5py
dask
,所以我想出了这个小实验

到目前为止,我所做的是:

  • 使用h5py创建(写入)数据集。这是通过使用
    maxshape
    resize
    替代方法来实现的,以避免将整个日期一次加载到内存中,从而避免内存问题

  • 使用“经典代码”估计一个轴上的简单操作(测量平均值),这意味着每1000行估计平均值

  • 重复上一步,但这次使用dask

  • 这是我到目前为止得到的第一步:

    # Write h5 dataset
    chunks = (100,500,2)
    tp = time.time()
    with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
        # create a 3D dataset inside one h5py file
        dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
        print(dset.shape)
        while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
            dset.resize(dset.shape[0]+10**4, axis=0)  # resize data
            print(dset.shape) # check new shape for each append
            dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
        tmp  = time.time()- tp
        print('Writting time: {}'.format(tmp))
    
    我想我遗漏了一些程序,似乎dask时间执行比经典方法需要更多的时间。此外,我认为块选项可能是原因,因为我对
    h5
    数据集和
    dask
    使用了相同的块大小

    欢迎就这一程序提出任何建议

    # Classical read
    tp = time.time()
    filename = 'path/3D_matrix_1.hdf5'
    with h5py.File(filename, mode='r') as f:
        # List all groups (actually there is only one)
        a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
        # Get the data
        result = f.get(a_group_key)
        print(result.shape)
        #print(type(result))
        # read each 1000 elements
        start_ = 0 # initialize a start counter
        means = []
        while start_ < result.shape[0]:
            arr = np.array(result[start_:start_+1000])
            m = arr.mean()
            #print(m)
            means.append(m)
            start_ += 1000
        final_mean = np.array(means).mean()
        print(final_mean, len(final_mean))
        tmp  = time.time()- tp
        print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))
    
    # Dask way
    from dask import delayed
    tp = time.time()
    import dask.array as da
    filename = 'path/3D_matrix_1.hdf5'
    dset = h5py.File(filename, 'r')
    dataset_names = list(dset.keys())[0] # to obtain dataset content
    result = dset.get(dataset_names)
    array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
    print('Gigabytes del input: {}'.format(array.nbytes / 1e9))  # Gigabytes of the input processed lazily
    x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
    print('Mean array: {}'.format(x.compute())) 
    tmp = time.time() - tp
    print('Total reading and measuring time with dask: {:.2f}'.format(tmp))