Dask 简单聚合期间工作进程崩溃_Dask_Dask Distributed

Dask 简单聚合期间工作进程崩溃

dask

Dask 简单聚合期间工作进程崩溃,dask,dask-distributed,Dask,Dask Distributed,我试图在一个4.5亿行的数据集中聚合各种列。当我使用Dask的内置聚合时，比如“min”、“max”、“std”、“mean”会在这个过程中不断地使一个工作人员崩溃我正在使用的文件可以在这里找到：查找test_set.csv 我有一个google kubernetes集群，它由3台8核机器组成，总内存为22GB 因为这些只是内置的聚合函数，所以我没有尝试过太多其他功能它也没有使用太多的RAM，它的总容量保持在6GB左右，而且我没有看到任何表示内存不足的错误以下是我的基本代码和被驱逐工人的错

我试图在一个4.5亿行的数据集中聚合各种列。当我使用Dask的内置聚合时，比如“min”、“max”、“std”、“mean”会在这个过程中不断地使一个工作人员崩溃

我正在使用的文件可以在这里找到：查找test_set.csv

我有一个google kubernetes集群，它由3台8核机器组成，总内存为22GB

因为这些只是内置的聚合函数，所以我没有尝试过太多其他功能

它也没有使用太多的RAM，它的总容量保持在6GB左右，而且我没有看到任何表示内存不足的错误

以下是我的基本代码和被驱逐工人的错误日志：

from dask.distributed import Client, progress
client = Client('google kubernetes cluster address')

test_df = dd.read_csv('gs://filepath/test_set.csv', blocksize=10000000)

def process_flux(df):
flux_ratio_sq = df.flux / df.flux_err
flux_by_flux_ratio_sq = (df.flux * flux_ratio_sq)
df_flux = dd.concat([df, flux_ratio_sq, flux_by_flux_ratio_sq], axis=1)
df_flux.columns = ['object_id', 'mjd', 'passband', 'flux', 'flux_err', 'detected', 'flux_ratio_sq', 'flux_by_flux_ratio_sq']
return df_flux

aggs = {
'flux': ['min', 'max', 'mean', 'std'],

'detected': ['mean'],
'flux_ratio_sq': ['sum'],
'flux_by_flux_ratio_sq': ['sum'],
'mjd' : ['max', 'min'],
}

def featurize(df):

start_df = process_flux(df)
agg_df = start_df.groupby(['object_id']).agg(aggs)
return agg_df

overall_start = timer()
final_df = featurize(test_df).compute()
overall_end = timer()

错误日志：

 distributed.core - INFO - Event loop was unresponsive in Worker for 74.42s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
 distributed.core - INFO - Event loop was unresponsive in Worker for 3.30s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
 distributed.core - INFO - Event loop was unresponsive in Worker for 3.75s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

出现了许多这样的情况，然后：

 distributed.core - INFO - Event loop was unresponsive in Worker for 65.16s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
 distributed.worker - ERROR - Worker stream died during communication: tcp://hidden address
 Traceback (most recent call last):
 File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 180, in read
n_frames = yield stream.read_bytes(8)
 File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 441, in read_bytes
self._try_inline_read()
 File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 911, in _try_inline_read
self._check_closed()
 File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 1112, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed

response = yield comm.read(deserializers=deserializers)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 201, in read
convert_stream_closed_error(self, e)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 127, in     convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: TimeoutError: [Errno 110] Connection timed out

distributed.core-INFO-Event循环在Worker中的响应时间为65.16秒。这通常是由于长期运行GIL保持函数或移动大块数据造成的。这可能导致超时和不稳定。
distributed.worker-错误-工作流在通信过程中死亡：tcp://hidden 地址
回溯（最近一次呼叫最后一次）：
文件“/opt/conda/lib/python3.6/site packages/distributed/comm/tcp.py”，第180行，已读
n_frames=产生流。读取字节（8）
文件“/opt/conda/lib/python3.6/site packages/tornado/iostream.py”，第441行，以读取字节为单位
self.\u try\u inline\u read（）
文件“/opt/conda/lib/python3.6/site-packages/tornado/iostream.py”，第911行，输入
自我检查关闭（）
文件“/opt/conda/lib/python3.6/site packages/tornado/iostream.py”，第1112行，在检查中关闭
raise StreamClosedError（实际错误=self.error）
tornado.iostream.StreamClosedError:流已关闭
响应=产生comm.read（反序列化程序=反序列化程序）
文件“/opt/conda/lib/python3.6/site packages/tornado/gen.py”，第1133行，运行中
value=future.result（）
包装器中的文件“/opt/conda/lib/python3.6/site packages/tornado/gen.py”，第326行
产生=下一个（结果）
文件“/opt/conda/lib/python3.6/site packages/distributed/comm/tcp.py”，第201行，已读
转换流关闭错误（self，e）
文件“/opt/conda/lib/python3.6/site packages/distributed/comm/tcp.py”，第127行，在convert\u stream\u closed\u error中
raise CommClosedError（“在%s中：%s:%s”%（对象，exc.\uuuuuu类\uuuuuu名称，exc））
distributed.comm.core.CommClosedError:in:TimeoutError:[Errno 110]连接超时

它运行得相当快，我只是希望在不让员工崩溃的情况下获得一致的性能

谢谢

你有什么收获吗？