Scikit learn 如何使用外部模块在joblib中的线程之间共享变量
我正在尝试修改sklearn源代码。特别是,我正在修改GridSearch源代码,使评估不同模型配置的独立进程/线程之间共享一个变量。我需要每个线程/进程在运行时读取/更新该变量,以便根据其他线程获得的内容修改它们的执行。更具体地说,我想分享的参数是best,在下面的片段中:Scikit learn 如何使用外部模块在joblib中的线程之间共享变量,scikit-learn,joblib,Scikit Learn,Joblib,我正在尝试修改sklearn源代码。特别是,我正在修改GridSearch源代码,使评估不同模型配置的独立进程/线程之间共享一个变量。我需要每个线程/进程在运行时读取/更新该变量,以便根据其他线程获得的内容修改它们的执行。更具体地说,我想分享的参数是best,在下面的片段中: out = parallel(delayed(_fit_and_score)(clone(base_estimator), X, y, best, self.early,train=train, test=test,par
out = parallel(delayed(_fit_and_score)(clone(base_estimator), X, y, best, self.early,train=train, test=test,parameters=parameters,**fit_and_score_kwargs) for parameters, (train, test) in product(candidate_params, cv.split(X, y, groups)))
值得注意的是_fit_和_score函数位于单独的模块中。
Sklearn利用joblib进行并行化,但我无法理解如何使用外部模块有效地实现这一点。在joblib文档中,提供了以下代码:
>>> shared_set = set()
>>> def collect(x):
... shared_set.add(x)
...
>>> Parallel(n_jobs=2, require='sharedmem')(
... delayed(collect)(i) for i in range(5))
[None, None, None, None, None]
>>> sorted(shared_set)
[0, 1, 2, 3, 4]
但我无法理解如何让它在我的环境中运行。您可以在此处找到源代码:
- 网格搜索:
- fit_和_分数:
from joblib import Parallel, delayed
from multiprocessing import Manager
manager = Manager()
q = manager.Namespace()
q.flag = False
def test(i, q):
#update shared var in 0 process
if i == 0:
q.flag = True
# do nothing for few seconds
for n in range(100000000):
if q.flag == True:
return f'process {i} was updated'
return 'process {i} was not updated'
out = Parallel(n_jobs=4)(delayed(test)(i, q) for i in range(4))
输出:
['process 0 was updated',
'process 1 was updated',
'process 2 was updated',
'process 3 was updated']