Python 使用锁创建Dask延迟。错误:_线程。_本地没有执行状态
我想创建一个包含多个块的Dask数组。 每个块都来自一个读取文件的函数。 为了避免同时从硬盘读取多个文件,我遵循答案并使用锁 但创建交易日会出现以下错误:Python 使用锁创建Dask延迟。错误:_线程。_本地没有执行状态,python,dask,Python,Dask,我想创建一个包含多个块的Dask数组。 每个块都来自一个读取文件的函数。 为了避免同时从硬盘读取多个文件,我遵循答案并使用锁 但创建交易日会出现以下错误: AttributeError: '_thread._local' object has no attribute 'execution_state' 测试: 完整错误消息: Traceback (most recent call last): File "<...>/site-packages/distributed/wor
AttributeError: '_thread._local' object has no attribute 'execution_state'
测试:
完整错误消息:
Traceback (most recent call last):
File "<...>/site-packages/distributed/worker.py", line 2536, in get_worker
return thread_state.execution_state['worker']
AttributeError: '_thread._local' object has no attribute 'execution_state'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_lock.py", line 32, in <module>
main()
File "test_lock.py", line 30, in main
ds = make_delayed()
File "test_lock.py", line 25, in make_delayed
read_lock = distributed.Lock('numpy-read')
File "<...>/site-packages/distributed/lock.py", line 92, in __init__
self.client = client or _get_global_client() or get_worker().client
File "<...>/site-packages/distributed/worker.py", line 2542, in get_worker
raise ValueError("No workers found")
ValueError: No workers found
回溯(最近一次呼叫最后一次):
文件“/site packages/distributed/worker.py”,第2536行,在get_worker中
返回线程状态。执行状态['worker']
AttributeError:“\u thread.\u local”对象没有“execution\u state”属性
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“test_lock.py”,第32行,在
main()
文件“test_lock.py”,第30行,在main中
ds=使延迟()
文件“test_lock.py”,第25行,make_delayed
read\u lock=distributed.lock('numpy-read')
文件“/site packages/distributed/lock.py”,第92行,在__
self.client=client或_get_global_client()或get_worker().client
文件“/site packages/distributed/worker.py”,第2542行,在get_worker中
提升值错误(“未找到工作人员”)
ValueError:未找到任何工作人员
试试这个
@dask.delayed
def load_numpy(fn):
lock = distributed.Lock('numpy-read')
lock.acquire()
out = np.load(fn)
lock.release()
return out
def make_delayed():
# np.load is a function that reads a file
# and returns a numpy array.
read_lock = distributed.Lock('numpy-read')
return [load_numpy('%d.npy' % i) for i in range(2)]
我遵循了他的答案,这里有一个基准:
import numpy as np
import dask
from dask.distributed import Client, Lock
import time
@dask.delayed
def locked_load(fn):
lock = Lock('numpy-read')
lock.acquire()
out = np.load(fn)
lock.release()
return out
@dask.delayed
def unlocked_load(fn):
return np.load(fn)
def work(arr_size, n_parts, use_lock=True):
if use_lock:
f = locked_load
else:
f = unlocked_load
x = np.arange(arr_size, dtype=np.int)
for i in range(n_parts):
np.save('%d.npy' % i, x)
d = [f('%d.npy' % i) for i in range(n_parts)]
return dask.compute(*d)
def main():
client = Client()
with open("lock_time.txt", "a") as fh:
n_parts_list = [20, 100]
arr_size_list = [1_000_000, 5_000_000, 10_000_000]
for n_part in n_parts_list:
for arr_size in arr_size_list:
for use_lock in [True, False]:
st = time.time()
work(arr_size, n_part, use_lock)
en = time.time()
fh.write("%d %d %s %s\n" % (
n_part, arr_size, use_lock, str(en - st))
)
fh.flush()
client.close()
if __name__ == '__main__':
main()
结果(计算机内存为16 GB):
您是对的,该锁适合与分布式调度程序一起使用,因此这不是一个错误。如果愿意,您也可以在本地使用分布式调度程序。在我使用
dask.distributed.Client
之后,当我计算一个延迟的线程时,我遇到另一个错误无法pickle\u thread.lock对象。我将打开另一个问题…但锁并没有显著加快I/o速度。您现在可以访问分布式计划程序的仪表板,该仪表板有一个“配置文件”选项卡,因此您可以找到时间的去向。您最初的问题不是关于性能,而是关于锁定。基准测试的时间还包括创建测试数据的时间。以后可能会解决这个问题。
import numpy as np
import dask
from dask.distributed import Client, Lock
import time
@dask.delayed
def locked_load(fn):
lock = Lock('numpy-read')
lock.acquire()
out = np.load(fn)
lock.release()
return out
@dask.delayed
def unlocked_load(fn):
return np.load(fn)
def work(arr_size, n_parts, use_lock=True):
if use_lock:
f = locked_load
else:
f = unlocked_load
x = np.arange(arr_size, dtype=np.int)
for i in range(n_parts):
np.save('%d.npy' % i, x)
d = [f('%d.npy' % i) for i in range(n_parts)]
return dask.compute(*d)
def main():
client = Client()
with open("lock_time.txt", "a") as fh:
n_parts_list = [20, 100]
arr_size_list = [1_000_000, 5_000_000, 10_000_000]
for n_part in n_parts_list:
for arr_size in arr_size_list:
for use_lock in [True, False]:
st = time.time()
work(arr_size, n_part, use_lock)
en = time.time()
fh.write("%d %d %s %s\n" % (
n_part, arr_size, use_lock, str(en - st))
)
fh.flush()
client.close()
if __name__ == '__main__':
main()
+--------+----------+----------+----------+
| n_part | arr_size | use_lock | time |
+--------+----------+----------+----------+
| 20 | 1000000 | True | 0.97 |
| 20 | 1000000 | False | 0.89 |
| 20 | 5000000 | True | 7.52 |
| 20 | 5000000 | False | 6.80 |
| 20 | 10000000 | True | 16.70 |
| 20 | 10000000 | False | 15.78 |
| 100 | 1000000 | True | 3.76 |
| 100 | 1000000 | False | 6.88 |
| 100 | 5000000 | True | 43.22 |
| 100 | 5000000 | False | 38.96 |
| 100 | 10000000 | True | 291.34 |
| 100 | 10000000 | False | 389.34 |
+--------+----------+----------+----------+