Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在1台计算机和大型数据集上使用Dask导致KilledWorker异常?_Python_Dask_Dask Distributed - Fatal编程技术网

Python 在1台计算机和大型数据集上使用Dask导致KilledWorker异常?

Python 在1台计算机和大型数据集上使用Dask导致KilledWorker异常?,python,dask,dask-distributed,Python,Dask,Dask Distributed,所以我的问题是,我曾尝试读取一个大文件(在我的16gb RAM PC上,几乎是12GB),但每次我尝试在Dask数据帧上执行一些操作时,都会出现一些错误,并出现异常,例如KilledWorker异常 我没有使用任何集群,我也尝试过使用pandas,但是RAM会上升到100%,所以我认为我目前除了Dask之外别无选择 请查看我的代码片段: from dask.distributed import Client, progress client = Client() client ddf = dd

所以我的问题是,我曾尝试读取一个大文件(在我的16gb RAM PC上,几乎是12GB),但每次我尝试在Dask数据帧上执行一些操作时,都会出现一些错误,并出现异常,例如KilledWorker异常

我没有使用任何集群,我也尝试过使用pandas,但是RAM会上升到100%,所以我认为我目前除了Dask之外别无选择

请查看我的代码片段:

from dask.distributed import Client, progress
client = Client()
client

ddf = dd.read_csv('C:\\Users\\user\\Desktop\\bigfile1.csv', encoding="latin-1", dtype="str")

mylist_a =['a', 'b', 'c', 'd', 'e','f','g','h','i']
daskfile1 = ddf.loc[:, ~ddf.columns.isin(mylist_a)].compute()
以下是错误:

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 10292 was killed by signal 15
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:6585'], ('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 76)
distributed.client - WARNING - Couldn't gather 200 keys, rescheduling {"('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 76)": ('tcp://127.0.0.1:6585',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 161)": ('tcp://127.0.0.1:6628',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 51)": ('tcp://127.0.0.1:6568',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 156)": ('tcp://127.0.0.1:6568',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 58)": ('tcp://127.0.0.1:6585',), "('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 176)": ('tcp://127.0.0.1:6628',),

distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:6715'], ('try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 66)
这里有一个例外:

KilledWorker                              Traceback (most recent call last)
<ipython-input-7-38194b0211e6> in <module>
      3 # df = ddf.compute()
      4 
----> 5 daskfile1 = ddf.loc[:, ~ddf.columns.isin(mylist_a)].compute()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
    154         dask.base.compute
    155         """
--> 156         (result,) = compute(self, traverse=False, **kwargs)
    157         return result
    158 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    396     keys = [x.__dask_keys__() for x in collections]
    397     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 398     results = schedule(dsk, keys, **kwargs)
    399     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    400 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2330             try:
   2331                 results = self.gather(packed, asynchronous=asynchronous,
-> 2332                                       direct=direct)
   2333             finally:
   2334                 for f in futures.values():

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1654             return self.sync(self._gather, futures, errors=errors,
   1655                              direct=direct, local_worker=local_worker,
-> 1656                              asynchronous=asynchronous)
   1657 
   1658     @gen.coroutine

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in sync(self, func, *args, **kwargs)
    674             return future
    675         else:
--> 676             return sync(self.loop, func, *args, **kwargs)
    677 
    678     def __repr__(self):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
    727 
    728                     try:
--> 729                         value = future.result()
    730                     except Exception:
    731                         exc_info = sys.exc_info()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
    734                     if exc_info is not None:
    735                         try:
--> 736                             yielded = self.gen.throw(*exc_info)  # type: ignore
    737                         finally:
    738                             # Break up a reference to itself

~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in _gather(self, futures, errors, direct, local_worker)
   1495                             six.reraise(type(exception),
   1496                                         exception,
-> 1497                                         traceback)
   1498                     if errors == 'skip':
   1499                         bad_keys.add(key)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ("('from-delayed-pandas_read_text-read-block-try_loc-ebb1f07a0b9e9ef0362fea6ff6407461', 36)", 'tcp://127.0.0.1:6717')
KilledWorker回溯(最近一次调用)
在里面
3#df=ddf.compute()
4.
---->5 daskfile1=ddf.loc[:,~ddf.columns.isin(mylist_a)].compute()
计算中的~\AppData\Local\Continuum\anaconda3\lib\site packages\dask\base.py(self,**kwargs)
154 dask.base.compute
155         """
-->156(结果,)=compute(自身,遍历=False,**kwargs)
157返回结果
158
计算中的~\AppData\Local\Continuum\anaconda3\lib\site packages\dask\base.py(*args,**kwargs)
396个键=[x.\uuuuu dask\u keys\uuuuu()表示集合中的x]
397 postcomputes=[x.\uuuu dask\u postcompute\uuuuu()表示集合中的x]
-->398结果=进度表(dsk、键、**kwargs)
399返回重新打包([f(r,*a)用于r,(f,a)压缩(结果,邮政编码)])
400
get中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py(self、dsk、key、限制、松散限制、资源、同步、异步、直接、重试、优先级、fifo\u超时、actors、**kwargs)
2330请尝试:
2331结果=自聚集(打包,异步=异步,
->2332直接=直接)
2333最后:
2334对于期货中的f.values():
聚集中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py(self、futures、errors、maxsize、direct、asynchronous)
1654返回self.sync(self.\u聚集、未来、错误=错误,
1655直接=直接,本地工人=本地工人,
->1656异步=异步)
1657
1658@gen.coroutine
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py处于同步状态(self、func、*args、**kwargs)
674回归未来
675其他:
-->676返回同步(self.loop、func、*args、**kwargs)
677
678定义报告(自我):
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\utils.py处于同步状态(循环、函数、*args、**kwargs)
275东等(10)
276如果错误[0]:
-->277六.重新播放(*错误[0])
278其他:
279返回结果[0]
~\AppData\Local\Continuum\anaconda3\lib\site packages\six.py重新播放(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
f()中的~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\utils.py
260如果超时不是无:
261 future=gen.with_超时(timedelta(seconds=timeout),future)
-->262结果[0]=未来收益率
263例外情况除外,作为exc:
264错误[0]=sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site packages\tornado\gen.py正在运行(self)
727
728试试:
-->729值=future.result()
730例外情况除外:
731 exc_info=sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site packages\tornado\gen.py正在运行(self)
734如果exc_信息不是无:
735尝试:
-->736=self.gen.throw(*exc_info)#类型:忽略
737最后:
738#打破对自身的引用
~\AppData\Local\Continuum\anaconda3\lib\site packages\distributed\client.py in\u-gather(self、futures、errors、direct、Local\u-worker)
1495 6.重新发布(类型(例外),
1496例外情况,
->1497(回溯)
1498如果错误==“跳过”:
1499个坏键。添加(键)
~\AppData\Local\Continuum\anaconda3\lib\site packages\six.py重新播放(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
死亡工人:(“('from-delayed-pandas_read_text-read-block-try_loc-ebb1f07a0b9e9ef0362fea6ff6407461',36)”,'tcp://127.0.0.1:6717')
我只想在大数据量上执行这个相对简单的操作 (请注意,我只显示了一小部分错误,因为有很多错误)


任何类型的帮助都将不胜感激

如果您查看任务管理器,最大内存消耗量是多少?另外,请查看它几乎是15.5GB,然后它会下降(我猜这是在工作人员死亡时发生的)老实说,我遇到过这个网页,但我不知道应该在设置中更改什么。如果加载的列数较少,是否也会出现此问题?尝试不使用并行,内存负载应该会低得多:
…compute(scheduler='single-threaded'))
ddf有多少个分区?如果你在任务管理器中查看,最大内存消耗量是多少?另外,检查一下它几乎是15.5GB,然后它会下降(我猜这是在工作人员死亡时发生的)老实说,我遇到过这个网页,但我不知道该如何更改设置。如果加载的列数较少,是否也会出现问题?请尝试不加载