使用python中的多处理功能对函数进行多次迭代，并使用多个参数返回多个值_Python_Multithreading_Machine Learning_Multiprocessing

使用python中的多处理功能对函数进行多次迭代，并使用多个参数返回多个值

python multithreading machine-learning

使用python中的多处理功能对函数进行多次迭代，并使用多个参数返回多个值,python,multithreading,machine-learning,multiprocessing,Python,Multithreading,Machine Learning,Multiprocessing,我对函数模型进行了100次迭代，因此，我尝试使用多处理来分配任务，为了获得最终输出，我尝试使用队列，但它花费了太多时间，未能达到多处理的目的。如何解决这个问题 def model(X,Y): ada_clf={} pred1={} auc_final=[] for iteration in range(100): ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=10

我对函数模型进行了100次迭代，因此，我尝试使用多处理来分配任务，为了获得最终输出，我尝试使用队列，但它花费了太多时间，未能达到多处理的目的。如何解决这个问题

def model(X,Y):
  ada_clf={}
  pred1={}
  auc_final=[]
  for iteration in range(100):
    ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
    ada_clf[iteration].fit(X,Y)
    pred1[iteration]=ada_clf[iteration].predict(test1)
   
  individuallabelsfromada1=[]
  for i in range(len(test1)):
    individuallabelsfromada1.append([])
    for j in range(100):
      individuallabelsfromada1[i].append(pred1[j][i])
  
  final_labels_ada1=[]
  for each in individuallabelsfromada1:
    final_labels_ada1.append(find_majority(each))
 
  final=pd.Series(final_labels_ada1)
  temp_arr=np.array(final)
  total_labels2=pd.Series(temp_arr)

  fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
  auc_final.append(auc(fpr,tpr))
  q.put(total_labels2)
  q1.put(auc_final)
  q2.put(ada_clf)
  
  print('done')

  
overall_labels={}
final_auc={}
final_ada_clf={}

processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
  if __name__=='__main__':
    p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
    overall_labels[iteration]=q.get()
    final_auc[iteration]=q1.get()
    final_ada_clf[iteration]=q2.get()
    p.start()
    processes.append(p)
for each in processes:
  each.join()

以下是我编辑过的版本，但只返回单个输出，我尝试使用多个输出，但无法获得它，因此只能使用单个输出，即total_labels2：-

    ##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
    total_labels2 = pd.Series(temp_arr)

    return (repeat,total_labels2)


def get_result(total_labels2):
    global testover_forall
    testover_forall.append(total_labels2)


if __name__ == '__main__':
    import multiprocessing as mp

    testover_forall = []

    pool = mp.Pool(40)
    for repeat in range(100):
        pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
    pool.close()
    pool.join()


repetations_index=[]
for i in range(100):
  repetations_index.append(testover_forall[i][0])

final_last_labels = {}
for i in range(100):
    temp = str(i)
    final_last_labels[temp] = testover_forall[repetations_index[i]][1]

totally_last_labels=[]
for each in final_last_labels:
  temp=np.array(final_last_labels[each])
  totally_last_labels.append(temp)

在你的帖子上看到我的评论（实际上是问题）

您应该使用多处理池将您创建的进程数限制为您拥有的CPU内核数。这也使得从

模型

函数中获取返回值变得更容易，而不是将结果写入3个不同的队列（您可以将3个值的元组写入一个队列）。当然，您需要其他

导入语句和代码。如果您使用了numpy
和其他库（这些库可能是用C语言实现的），您也可以尝试使用线程来运行此库，以查看这是否有助于或影响性能。为此，请在引用的两个位置将ProcessPoolExecutor
替换为ThreadPoolExecutor

注意
model
对传递的参数X和Y所做的任何更改都不会反映回主进程。因此，如果使用相同的参数反复调用model
，就不清楚每个调用是否会返回不同的值，尤其是在并行调用的情况下
来自concurrent.futures导入ProcessPoolExecutor
def型号（X，Y）：
ada_clf={}
pred1={}
auc_final=[]
对于范围（100）内的迭代：
ada_clf[迭代]=AdaBoostClassifier（DecisionTreeClassifier（），n_估计量=1000，学习率=0.001）
ada_clf[迭代].fit（X，Y）
pred1[iteration]=ada_clf[iteration]。预测（test1）
individuallabelsfromada1=[]
对于范围内的i（len（test1））：
individuallabelsfromada1.append（[]）
对于范围（100）内的j：
individuallabelsfromada1[i]。追加（pred1[j][i]）
最终标签\u ada1=[]
对于ADA1中的每个单独标签：
最终标签1.附加（找到多数（每个））
最终=pd系列（最终标签1）
临时阵列=np.阵列（最终）
总标签2=局部放电系列（温度阵列）
fpr、tpr、阈值=roc\U曲线（y\U测试，总标签2，位置标签=1）
auc_最终追加（auc（fpr，tpr））
#q、 put（总标签数2）
#q1.出售（auc_最终版）
#q2.投入产出（ada_clf）
返回总标签2、最终auc、ada\U clf
#打印（‘完成’）
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
以ProcessPoolExecutor（）作为执行器：
futures=[executor.submit（模型、x_序列、y_标签）用于范围（100）内的迭代]
#简单的列表就足够了：
整体_标签=[]
最终auc=[]
最终ada_clf=[]
对于未来：
#获取返回值并存储
total_labels2，auc_final，ada_clf=future.result（）
总标签。附加（总标签2）
最终拍卖追加（最终拍卖）
最终ada_clf.追加（ada_clf）

更新
从问题说明中不清楚返回的结果是否基于随机数生成器，如果连续调用worker函数，model
，则不会在多处理池中的所有进程中使用单个随机数生成器，然后，多处理实现将明显返回与非多处理版本不同的结果。从提供的代码中不清楚使用随机数发生器的位置；它可能是您无法访问的库代码。如果是这种情况，您有两个选择：（1）使用多线程，而不是通过更改import
语句，正如我在下面的代码中所指出的那样；您仍然可以实现我已经提到的性能优势，或者（2）将签名更新为型号
，如下所示。将向您传递一个新参数random\u generator，该参数当前支持两种方法，randint
（如random.randint
和random
（如random.random
），尽管如果您需要与模块随机
不同的方法，修改代码应该很容易。如果您能够，您将使用此随机数生成器代替模块随机
。但请注意，此随机生成器的运行速度将比标准生成器慢得多；这是您付出的代价
由于我们还在模型
中添加了一个重复参数（现在它必须是最终参数——注意下面更新的签名），我们现在可以使用方法映射
（无需使用回调）：
def模型（X，Y，随机_生成器，重复）：
...
等
来自多处理导入池
#或者使用以下导入来使用多线程（但随后使用标准随机生成器）：
#来自multiprocessing.dummy导入池
随机输入
从functools导入部分
从multiprocessing.managers导入BaseManager
类RandomGeneratorManager（BaseManager）：
通过
类随机生成器：
定义初始化（自）：
随机种子（0）
def randint（自我、a、b）：
返回random.randint（a，b）
def随机（自）：
返回random.random（）
#如果需要，添加其他功能
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
RandomGeneratorManager.register（'RandomGenerator'，RandomGenerator）
使用RandomGeneratorManager（）作为管理器：
random_generator=manager.RandomGenerator（）
#为什么是40？为什么不使用默认值，即您拥有的cpu内核数
池=池（40）：
工人=部分（型号、x_系列、y_标签、随机发生器）
结果=pool.map（工作者，范围（100））
您正在运行100个调用模型的进程