Python 在1台计算机和大型数据集上使用Dask导致KilledWorker异常?
所以我的问题是,我曾尝试读取一个大文件(在我的16gb RAM PC上,几乎是12GB),但每次我尝试在Dask数据帧上执行一些操作时,都会出现一些错误,并出现异常,例如KilledWorker异常 我没有使用任何集群,我也尝试过使用pandas,但是RAM会上升到100%,所以我认为我目前除了Dask之外别无选择 请查看我的代码片段:Python 在1台计算机和大型数据集上使用Dask导致KilledWorker异常?,python,dask,dask-distributed,Python,Dask,Dask Distributed,所以我的问题是,我曾尝试读取一个大文件(在我的16gb RAM PC上,几乎是12GB),但每次我尝试在Dask数据帧上执行一些操作时,都会出现一些错误,并出现异常,例如KilledWorker异常 我没有使用任何集群,我也尝试过使用pandas,但是RAM会上升到100%,所以我认为我目前除了Dask之外别无选择 请查看我的代码片段: from dask.distributed import Client, progress client = Client() client ddf = dd
from dask.distributed import Client, progress
client = Client()
client
ddf = dd.read_csv('C:\\Users\\user\\Desktop\\bigfile1.csv', encoding="latin-1", dtype="str")
mylist_a =['a', 'b', 'c', 'd', 'e','f','g','h','i']
daskfile1 = ddf.loc[:, ~ddf.columns.isin(mylist_a)].compute()
以下是错误:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 10292 was killed by signal 15
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:6585'], ('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 76)
distributed.client - WARNING - Couldn't gather 200 keys, rescheduling {"('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 76)": ('tcp://127.0.0.1:6585',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 161)": ('tcp://127.0.0.1:6628',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 51)": ('tcp://127.0.0.1:6568',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 156)": ('tcp://127.0.0.1:6568',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 58)": ('tcp://127.0.0.1:6585',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 176)": ('tcp://127.0.0.1:6628',),
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:6715'], ('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 66)
这里有一个例外:
KilledWorker Traceback (most recent call last)
<ipython-input-7-38194b0211e6> in <module>
3 # df = ddf.compute()
4
----> 5 daskfile1 = ddf.loc[:, ~ddf.columns.isin(mylist_a)].compute()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
396 keys = [x.__dask_keys__() for x in collections]
397 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 398 results = schedule(dsk, keys, **kwargs)
399 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
400
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2330 try:
2331 results = self.gather(packed, asynchronous=asynchronous,
-> 2332 direct=direct)
2333 finally:
2334 for f in futures.values():
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1654 return self.sync(self._gather, futures, errors=errors,
1655 direct=direct, local_worker=local_worker,
-> 1656 asynchronous=asynchronous)
1657
1658 @gen.coroutine
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in sync(self, func, *args, **kwargs)
674 return future
675 else:
--> 676 return sync(self.loop, func, *args, **kwargs)
677
678 def __repr__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
727
728 try:
--> 729 value = future.result()
730 except Exception:
731 exc_info = sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
734 if exc_info is not None:
735 try:
--> 736 yielded = self.gen.throw(*exc_info) # type: ignore
737 finally:
738 # Break up a reference to itself
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in _gather(self, futures, errors, direct, local_worker)
1495 six.reraise(type(exception),
1496 exception,
-> 1497 traceback)
1498 if errors == 'skip':
1499 bad_keys.add(key)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
KilledWorker: ("('from-delayed-pandas_read_text-read-block-try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 36)", 'tcp://127.0.0.1:6717')
KilledWorker回溯(最近一次调用)
在里面
3#df=ddf.compute()
4.
---->5 daskfile1=ddf.loc[:,~ddf.columns.isin(mylist_a)].compute()
计算中的~\AppData\Local\Continuum\anaconda3\lib\site packages\dask\base.py(self,**kwargs)
154 dask.base.compute
155 """
-->156(结果,)=compute(自身,遍历=False,**kwargs)
157返回结果
158
计算中的~\AppData\Local\Continuum\anaconda3\lib\site packages\dask\base.py(*args,**kwargs)
396个键=[x.\uuuuu dask\u keys\uuuuu()表示集合中的x]
397 postcomputes=[x.\uuuu dask\u postcompute\uuuuu()表示集合中的x]
-->398结果=进度表(dsk、键、**kwargs)
399返回重新打包([f(r,*a)用于r,(f,a)压缩(结果,邮政编码)])
400
get中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py(self、dsk、key、限制、松散限制、资源、同步、异步、直接、重试、优先级、fifo\u超时、actors、**kwargs)
2330请尝试:
2331结果=自聚集(打包,异步=异步,
->2332直接=直接)
2333最后:
2334对于期货中的f.values():
聚集中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py(self、futures、errors、maxsize、direct、asynchronous)
1654返回self.sync(self.\u聚集、未来、错误=错误,
1655直接=直接,本地工人=本地工人,
->1656异步=异步)
1657
1658@gen.coroutine
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py处于同步状态(self、func、*args、**kwargs)
674回归未来
675其他:
-->676返回同步(self.loop、func、*args、**kwargs)
677
678定义报告(自我):
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\utils.py处于同步状态(循环、函数、*args、**kwargs)
275东等(10)
276如果错误[0]:
-->277六.重新播放(*错误[0])
278其他:
279返回结果[0]
~\AppData\Local\Continuum\anaconda3\lib\site packages\six.py重新播放(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
f()中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\utils.py
260如果超时不是无:
261 future=gen.with_超时(timedelta(seconds=timeout),future)
-->262结果[0]=未来收益率
263例外情况除外,作为exc:
264错误[0]=sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site packages\tornado\gen.py正在运行(self)
727
728试试:
-->729值=future.result()
730例外情况除外:
731 exc_info=sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site packages\tornado\gen.py正在运行(self)
734如果exc_信息不是无:
735尝试:
-->736=self.gen.throw(*exc_info)#类型:忽略
737最后:
738#打破对自身的引用
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py in\u-gather(self、futures、errors、direct、Local\u-worker)
1495 6.重新发布(类型(例外),
1496例外情况,
->1497(回溯)
1498如果错误==“跳过”:
1499个坏键。添加(键)
~\AppData\Local\Continuum\anaconda3\lib\site packages\six.py重新播放(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
死亡工人:(“('from-delayed-pandas_read_text-read-block-try_loc-ebb1f07a0b9e9ef0362fea6ff6407461',36)”,'tcp://127.0.0.1:6717')
我只想在大数据量上执行这个相对简单的操作
(请注意,我只显示了一小部分错误,因为有很多错误)
任何类型的帮助都将不胜感激如果您查看任务管理器,最大内存消耗量是多少?另外,请查看它几乎是15.5GB,然后它会下降(我猜这是在工作人员死亡时发生的)老实说,我遇到过这个网页,但我不知道应该在设置中更改什么。如果加载的列数较少,是否也会出现此问题?尝试不使用并行,内存负载应该会低得多:
…compute(scheduler='single-threaded'))
ddf有多少个分区?如果你在任务管理器中查看,最大内存消耗量是多少?另外,检查一下它几乎是15.5GB,然后它会下降(我猜这是在工作人员死亡时发生的)老实说,我遇到过这个网页,但我不知道该如何更改设置。如果加载的列数较少,是否也会出现问题?请尝试不加载