Python 我的dask代码似乎在多线程模式下工作,但在多处理模式下失败
当我尝试使用dask执行groupby时,代码失败:Python 我的dask代码似乎在多线程模式下工作,但在多处理模式下失败,python,parallel-processing,dask,Python,Parallel Processing,Dask,当我尝试使用dask执行groupby时,代码失败: other_df = ddf.groupby( by=[self.phone_field, self.state_field]).\ apply(lambda x: self. obtain_cluster_nos_weighted_levenshtein(x.copy()),
other_df = ddf.groupby(
by=[self.phone_field, self.state_field]).\
apply(lambda x: self.
obtain_cluster_nos_weighted_levenshtein(x.copy()),
meta={self.address_id_field: "f8",
self.add_clust_field: "i8"}).compute(scheduler='processes')
这是回溯:
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/dask/local.py", line 461, in fire_task
dumps((dsk[key], data)),
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 774, in save_tuple
save(element)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 687, in _save_reduce_pickle5
save(state)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 774, in save_tuple
save(element)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 774, in save_tuple
save(element)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 774, in save_tuple
save(element)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/ec2-user/anaconda3/lib/python3.7/pickle.py", line 524, in save
rv = reduce(self.proto)
NotImplementedError: object proxy must define __reduce_ex__()
我猜这与分配给工人之前的酸洗有关。仅当scheduler='processes'时,此问题才会弹出。对于多线程,它执行良好。如何解决这个问题?调用类方法进行多处理似乎不是一个好主意。我将self.acquire\u cluster\u nos\u weighted\u levenshtein声明为一个独立函数,我的问题就解决了。调用该函数时,您实际上是在序列化实例,可能是在序列化闭包——这可能是无法pickle的任何事情之一。