使用Dask访问大型已发布数组中的单个元素_Dask_Dask Delayed_Dask Distributed

使用Dask访问大型已发布数组中的单个元素

dask

使用Dask访问大型已发布数组中的单个元素,dask,dask-delayed,dask-distributed,Dask,Dask Delayed,Dask Distributed,是否有一种更快的方法，可以使用Dask仅检索大型已发布数组中的单个元素，而不检索整个数组在下面的示例中，client.get_数据集“array1”[0]所用的时间与client.get_数据集“array1”所用的时间大致相同 import distributed client = distributed.Client() data = [1]*10000000 payload = {'array1': data} client.publish(**payload) one_element

是否有一种更快的方法，可以使用Dask仅检索大型已发布数组中的单个元素，而不检索整个数组

在下面的示例中，client.get_数据集“array1”[0]所用的时间与client.get_数据集“array1”所用的时间大致相同

import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)

one_element = client.get_dataset('array1')[0]

请注意，您发布的任何内容都将发送到调度程序，而不是工作程序，因此这有点低效。Publish旨在与Dask集合（如Dask.array）一起使用

客户1 客户2

我想将此问题标记为已完成，但当我使用建议的方法运行此代码段时，它挂起在[0].compute上。对于我使用的实际代码段，请参见下面的链接：我能够让您发布的代码段正常工作。为了让它工作，我必须确保在我的版本中也为调度器b/c创建了工作程序。对于其他人，请参阅上面的我的链接以了解更多信息。

import dask.array as da
x = da.ones(10000000, chunks=(100000,))  # 1e7 size array cut into 1e5 size chunks
x = x.persist()  # persist array on the workers of the cluster

client.publish(x=x)  # store the metadata of x on the scheduler

x = client.get_dataset('x')  # get the lazy collection x
x[0].compute()  # this selection happens on the worker, only the result comes down