Pandas (Dask)如何分配计算所需的昂贵资源?
在使用相对昂贵的数据集创建计算资源或对象的数据集中分发任务的最佳方式是什么Pandas (Dask)如何分配计算所需的昂贵资源?,pandas,dask,python-3.7,dask-distributed,Pandas,Dask,Python 3.7,Dask Distributed,在使用相对昂贵的数据集创建计算资源或对象的数据集中分发任务的最佳方式是什么 # in pandas df = pd.read_csv(...) foo = Foo() # expensive initialization. result = df.apply(lambda x: foo.do(x)) # in dask? # is it possible to scatter the foo to the workers? client.scatter(... 我计划将其用于带有SGEClu
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
我计划将其用于带有SGECluster的dask_jobqueue。有没有办法在client.map(do,df)框架中使用它?或者,这是同一件事吗?
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda