Dataframe dask丢弃复制性能

Dataframe dask丢弃复制性能,dataframe,dask,scalability,Dataframe,Dask,Scalability,我正试图让函数df.drop_duplicates()按预期工作,但存在性能和可伸缩性问题 我的基础架构包含一个客户机(现在让我们把它变成一个jupyter)和一个dask集群,每个集群有15个工作线程、1cpu和9GB内存 数据是一组拼花文件(假设为2GB),包含2kk行文本数据,我需要通过一个源URL消除所有帖子的重复数据 致电后: some_file='s3://my_bucket/uploader/8953053034-tf4csv/parquet' df = dd.re

我正试图让函数df.drop_duplicates()按预期工作,但存在性能和可伸缩性问题

我的基础架构包含一个客户机(现在让我们把它变成一个jupyter)和一个dask集群,每个集群有15个工作线程、1cpu和9GB内存

数据是一组拼花文件(假设为2GB),包含2kk行文本数据,我需要通过一个源URL消除所有帖子的重复数据 致电后:

    some_file='s3://my_bucket/uploader/8953053034-tf4csv/parquet'
    df = dd.read_parquet(some, storage_options={'key': s3.key, 'secret': s3.secret})
    dedup_df = df.drop_duplicates(subset=['Url'], split_out=df.npartitions)

开始时集群工作正常,速度相当快,但当到达终点时,它开始减速,直到什么都没有发生。 首先,由于工作人员的负载,我需要将连接调度人员的超时时间增加到60秒:

dask-worker 
dask-worker Traceback (most recent call last):
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 2122, in handle_missing_dep
dask-worker     who_has = await retry_operation(self.scheduler.who_has, keys=list(deps))
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
dask-worker     operation=operation,
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
dask-worker     return await coro()
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 858, in send_recv_from_rpc
dask-worker     comm = await self.pool.connect(self.addr)
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 1013, in connect
dask-worker     **self.connection_args,
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 245, in connect
dask-worker     _raise(error)
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
dask-worker     raise IOError(msg)
dask-worker OSError: Timed out trying to connect to 'tcp://dask-scheduler:8786' after 10 s: Timed out trying to connect to 'tcp://dask-scheduler:8786' after 10 s: connect() didn't finish in time
dask-worker 2020-07-30 10:59:24,987 - distributed.worker - ERROR - Handle missing dep failed, retrying
dask-worker Traceback (most recent call last):
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 234, in connect
dask-worker     _raise(error)
dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
dask-worker     raise IOError(msg)
dask-worker OSError: Timed out trying to connect to 'tcp://dask-scheduler:8786' after 10 s: connect() didn't finish in time
但是,即使我增加了超时时间,处理也会冻结,并且在初始处理阶段之后集群上不会发生任何事情。。。 dask UI显示(在最后20分钟内静止不动,工人和调度程序上没有额外的日志):

我已经使用了很多分区,但是没有明显的区别

如何在dask中成功消除60-80GB的重复数据

更新: 我尝试了一个更大的文件(15Gb),并得到了保姆的行动,导致进程被杀死

dask-worker-5d64694d56-bh6mv dask-worker 2020-07-30 12:49:12,403 - distributed.core - INFO - Event loop was unresponsive in Worker for 15.88s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
dask-worker-5d64694d56-xdmvf dask-worker 2020-07-30 12:49:12,426 - distributed.nanny - INFO - Worker process 14 was killed by signal 9
dask-worker-5d64694d56-58vxw dask-worker 2020-07-30 12:49:12,455 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.117.111:42609
dask-worker-5d64694d56-58vxw dask-worker Traceback (most recent call last):
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 184, in read
dask-worker-5d64694d56-58vxw dask-worker     n_frames = await stream.read_bytes(8)
dask-worker-5d64694d56-58vxw dask-worker tornado.iostream.StreamClosedError: Stream is closed
dask-worker-5d64694d56-58vxw dask-worker 
dask-worker-5d64694d56-58vxw dask-worker During handling of the above exception, another exception occurred:
dask-worker-5d64694d56-58vxw dask-worker 
dask-worker-5d64694d56-58vxw dask-worker Traceback (most recent call last):
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 1983, in gather_dep
dask-worker-5d64694d56-58vxw dask-worker     self.rpc, deps, worker, who=self.address
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3258, in get_data_from_worker
dask-worker-5d64694d56-58vxw dask-worker     return await retry_operation(_get_data, operation="get_data_from_worker")
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
dask-worker-5d64694d56-58vxw dask-worker     operation=operation,
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
dask-worker-5d64694d56-58vxw dask-worker     return await coro()
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3245, in _get_data
dask-worker-5d64694d56-58vxw dask-worker     max_connections=max_connections,
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 644, in send_recv
dask-worker-5d64694d56-58vxw dask-worker     response = await comm.read(deserializers=deserializers)
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 199, in read
dask-worker-5d64694d56-58vxw dask-worker     convert_stream_closed_error(self, e)
dask-worker-5d64694d56-58vxw dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 123, in convert_stream_closed_error
dask-worker-5d64694d56-58vxw dask-worker     raise CommClosedError("in %s: %s" % (obj, exc))
dask-worker-5d64694d56-58vxw dask-worker distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
dask-worker-5d64694d56-htf48 dask-worker 2020-07-30 12:49:12,479 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.117.111:42609
dask-worker-5d64694d56-htf48 dask-worker Traceback (most recent call last):
dask-worker-5d64694d56-htf48 dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 193, in read
dask-worker-5d64694d56-htf48 dask-worker     n = await stream.read_into(frame)
dask-worker-5d64694d56-htf48 dask-worker tornado.iostream.StreamClosedError: Stream is closed
dask-worker-5d64694d56-htf48 dask-worker 
dask-worker-5d64694d56-htf48 dask-worker During handling of the above exception, another exception occurred:
dask-worker-5d64694d56-htf48 dask-worker 
dask-worker-5d64694d56-htf48 dask-worker Traceback (most recent call last):
dask-worker-5d64694d56-htf48 dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 1983, in gather_dep
dask-worker-5d64694d56-htf48 dask-worker     self.rpc, deps, worker, who=self.address
dask-worker-5d64694d56-htf48 dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3258, in get_data_from_worker
dask-worker-5d64694d56-htf48 dask-worker     return await retry_operation(_get_data, operation="get_data_from_worker")
dask-worker-5d64694d56-htf48 dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
dask-worker-5d64694d56-htf48dask-worker-5d64694d56-4ml4f  dask-workerdask-worker  2020-07-30 12:49:12,494 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.117.111:42609
    operation=operation,
dask-worker-5d64694d56-4ml4f dask-worker-5d64694d56-htf48dask-worker  Traceback (most recent call last):
dask-worker-5d64694d56-4ml4fdask-worker  dask-worker  File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 184, in read
dask-worker-5d64694d56-htf48 dask-worker-5d64694d56-4ml4f dask-workerdask-worker      n_frames = await stream.read_bytes(8)
    return await coro()
dask-worker-5d64694d56-4ml4f dask-worker-5d64694d56-htf48dask-worker  tornado.iostream.StreamClosedError: Stream is closed
dask-workerdask-worker-5d64694d56-4ml4f    File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3245, in _get_data
dask-worker 
dask-worker-5d64694d56-htf48dask-worker-5d64694d56-4ml4f  dask-workerdask-worker  During handling of the above exception, another exception occurred:
    max_connections=max_connections,
dask-worker-5d64694d56-4ml4f dask-worker-5d64694d56-htf48dask-worker  
dask-workerdask-worker-5d64694d56-4ml4f    File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 644, in send_recv
dask-worker Traceback (most recent call last):
dask-worker-5d64694d56-htf48 dask-worker-5d64694d56-4ml4f dask-workerdask-worker      response = await comm.read(deserializers=deserializers)
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 1983, in gather_dep
dask-worker-5d64694d56-4ml4fdask-worker-5d64694d56-htf48  dask-worker dask-worker    self.rpc, deps, worker, who=self.address
   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 199, in read
dask-worker-5d64694d56-htf48 dask-worker     convert_stream_closed_error(self, e)
dask-worker-5d64694d56-htf48 dask-worker   File "/usr/local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 123, in convert_stream_closed_error
dask-worker-5d64694d56-htf48 dask-worker     raise CommClosedError("in %s: %s" % (obj, exc))
dask-worker-5d64694d56-htf48 dask-worker distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
dask-worker-5d64694d56-bh6mv dask-worker 2020-07-30 12:49:12403-distributed.core-INFO-Event循环在worker中15.88秒无响应。这通常是由于长期运行GIL保持函数或移动大块数据造成的。这可能导致超时和不稳定。
dask-worker-5d64694d56-xdmvf dask worker 2020-07-30 12:49:12426-分布式保姆-信息-工作进程14被信号9终止
dask-worker-5d64694d56-58vxw dask worker 2020-07-30 12:49:12455-分布式工作者-错误-工作者流在通信过程中死亡:tcp://192.168.117.111:42609
dask-worker-5d64694d56-58vxw dask工作人员回溯(最近一次呼叫最后一次):
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/comm/tcp.py”,第184行,已读
dask-worker-5d64694d56-58vxw dask工作者n_帧=等待流。读取字节(8)
dask-worker-5d64694d56-58vxw dask worker tornado.iostream.StreamClosedError:流已关闭
dask-worker-5d64694d56-58vxw dask-worker
dask-worker-5d64694d56-58vxw dask worker在处理上述异常期间,发生了另一个异常:
dask-worker-5d64694d56-58vxw dask-worker
dask-worker-5d64694d56-58vxw dask工作人员回溯(最近一次呼叫最后一次):
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/worker.py”,1983行,位于gather_dep
dask-worker-5d64694d56-58vxw dask-worker-self.rpc,deps,worker,who=self.address
dask-worker-5d64694d56-58vxw dask工作者文件“/usr/local/lib/python3.7/site packages/distributed/worker.py”,第3258行,从工作者获取数据
dask-worker-5d64694d56-58vxw dask工作进程返回等待重试\u操作(\u获取\u数据,操作=“从\u工作进程获取\u数据”)
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/utils_comm.py”,第390行,在重试操作中
dask-worker-5d64694d56-58vxw dask工作者操作=操作,
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/utils_comm.py”,第370行,在重试中
dask-worker-5d64694d56-58vxw dask工作人员返回等待coro()
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/worker.py”,第3245行,在获取数据中
dask-worker-5d64694d56-58vxw dask工作者最大连接数=最大连接数,
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/core.py”,第644行,在send_recv中
dask-worker-5d64694d56-58vxw dask工作响应=等待comm.read(反序列化程序=反序列化程序)
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/comm/tcp.py”,第199行,已读
dask-worker-5d64694d56-58vxw dask工作转换\流\关闭\错误(自我,e)
dask-worker-5d64694d56-58vxw dask工作文件“/usr/local/lib/python3.7/site packages/distributed/comm/tcp.py”,第123行,转换流中\u关闭\u错误
dask-worker-5d64694d56-58vxw dask worker raise CommClosedError(“在%s中:%s”%(obj,exc))
dask-worker-5d64694d56-58vxw分布式dask工作程序.comm.core.CommClosedError:in:流已关闭
dask-worker-5d64694d56-htf48 dask-worker 2020-07-30 12:49:12479-分布式工作-错误-工作流在通信过程中死亡:tcp://192.168.117.111:42609
dask-worker-5d64694d56-htf48 dask工作进程回溯(最近一次呼叫最后一次):
dask-worker-5d64694d56-htf48 dask工作文件“/usr/local/lib/python3.7/site packages/distributed/comm/tcp.py”,第193行,已读
dask-worker-5d64694d56-htf48 dask worker n=等待流。读入(帧)
dask-worker-5d64694d56-htf48 dask worker tornado.iostream.StreamClosedError:流已关闭
dask-worker-5d64694d56-htf48 dask-worker
dask-worker-5d64694d56-htf48 dask worker在处理上述异常期间,发生了另一个异常:
dask-worker-5d64694d56-htf48 dask-worker
dask-worker-5d64694d56-htf48 dask工作进程回溯(最近一次呼叫最后一次):
dask-worker-5d64694d56-htf48 dask-worker文件“/usr/local/lib/python3.7/site-packages/distributed/worker.py”,第1983行,位于聚集区
dask-worker-5d64694d56-htf48 dask-worker-self.rpc,deps,worker,who=self.address
dask-worker-5d64694d56-htf48 dask工作者文件“/usr/local/lib/python3.7/site packages/distributed/worker.py”,第3258行,从工作者获取数据
dask-worker-5d64694d56-htf48 dask工作进程返回等待重试操作(“获取数据,操作=“从工作进程获取数据”)
dask-worker-5d64694d56-htf48 dask worker文件“/usr/local/lib/python3.7/site packages/distributed/utils_comm.py”,第390行,在重试操作中
dask-worker-5d64694d56-htf48dask-worker-5d64694d56-4ml4f dask WorkerTask worker 2020-07-30 12:49:12494-分布式工作-错误-工作流在通信过程中死亡:tcp://192.168.117.111:42609