如何在Python中对循环内的操作执行多线程_Python_Multithreading_Python Multithreading

如何在Python中对循环内的操作执行多线程

python multithreading

如何在Python中对循环内的操作执行多线程,python,multithreading,python-multithreading,Python,Multithreading,Python Multithreading,假设我有一个非常大的列表，我正在执行这样的操作： for item in items: try: api.my_operation(item) except: print 'error with item' 我的问题有两个：有很多东西 api.my_操作需要很长时间才能返回我想使用多线程一次加速一系列api.my_操作，这样我就可以一次处理5个、10个甚至100个项目如果我的_操作（）返回一个异常（因为可能我已经处理了该项）——那没关系

假设我有一个非常大的列表，我正在执行这样的操作：

for item in items:
    try:
        api.my_operation(item)
    except:
        print 'error with item'

我的问题有两个：

有很多东西
api.my_操作需要很长时间才能返回

我想使用多线程一次加速一系列api.my_操作，这样我就可以一次处理5个、10个甚至100个项目

如果我的_操作（）返回一个异常（因为可能我已经处理了该项）——那没关系。它不会打碎任何东西。循环可以继续到下一项

注意：这是针对Python 2.7.3的

首先，在Python中，如果您的代码是CPU绑定的，多线程将没有帮助，因为一次只有一个线程可以持有全局解释器锁，因此运行Python代码。因此，您需要使用进程，而不是线程

如果您的操作“需要永远返回”，则情况并非如此，因为它是IO绑定的，即等待网络或磁盘拷贝等。我稍后再谈这个问题

接下来，一次处理5个、10个或100个项目的方法是创建一个包含5个、10个或100个工人的池，并将项目放入工人服务的队列中。幸运的是，stdlib和库都为您提供了大部分细节

前者对于传统编程更强大、更灵活；如果您需要撰写未来等待，则后者更简单；对于琐碎的情况，你选择哪一个真的无关紧要。（在这种情况下，最明显的实现是使用

futures

执行3行，使用

多处理执行4行）
如果您使用的是2.6-2.7或3.0-3.1，futures
不是内置的，但是您可以从（pip install futures
）安装它

最后，如果可以将整个循环迭代转化为函数调用（例如，可以传递到map
），那么并行化通常要简单得多，所以我们先来做：
def try_my_operation(item):
    try:
        api.my_operation(item)
    except:
        print('error with item')


总而言之：
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_my_operation, item) for item in items]
concurrent.futures.wait(futures)


如果您有很多相对较小的作业，多处理的开销可能会淹没收益。解决这一问题的方法是将工作批量到更大的作业中。例如（使用来自的grouper
，您可以将其复制并粘贴到代码中，或从PyPI上的more itertools
项目获取）：

最后，如果您的代码是IO绑定的呢？那么线程就和进程一样好，开销也更小（限制也更少，但在这种情况下这些限制通常不会影响您）。有时候，这种“更少的开销”就足以意味着不需要对线程进行批处理，但需要对进程进行批处理，这是一个很好的胜利
那么，如何使用线程而不是进程呢？只需将ProcessPoolExecutor
更改为ThreadPoolExecutor

如果您不确定您的代码是CPU绑定的还是IO绑定的，请尝试两种方法

我可以在python脚本中对多个函数执行此操作吗？例如，如果我想并行化代码中的其他地方有另一个For循环。是否可以在同一脚本中执行两个多线程函数
对。事实上，有两种不同的方法
首先，您可以共享同一个（线程或进程）执行器，并在多个位置毫无问题地使用它。任务和未来的全部意义在于它们是独立的；你不在乎他们跑哪里，只在乎你把他们排成一列，最终得到答案
或者，您可以在同一个程序中有两个执行器，而不会出现问题。如果同时使用两个执行器，则会导致性能损失，最终会尝试在8个内核上运行（例如）16个繁忙线程，这意味着将进行一些上下文切换。但有时这是值得的，因为，比方说，两个执行者很少同时忙碌，这会使代码简单得多。或者，一个执行者正在运行可能需要一段时间才能完成的非常大的任务，而另一个执行者正在运行需要尽快完成的非常小的任务，因为对于程序的一部分来说，响应能力比吞吐量更重要
如果您不知道哪个适合您的程序，通常是第一个。
您可以使用以下方法将处理拆分为指定数量的线程：
import threading                                                                

def process(items, start, end):                                                 
    for item in items[start:end]:                                               
        try:                                                                    
            api.my_operation(item)                                              
        except Exception:                                                       
            print('error with item')                                            


def split_processing(items, num_splits=4):                                      
    split_size = len(items) // num_splits                                       
    threads = []                                                                
    for i in range(num_splits):                                                 
        # determine the indices of the list this thread will handle             
        start = i * split_size                                                  
        # special case on the last chunk to account for uneven splits           
        end = None if i+1 == num_splits else (i+1) * split_size                 
        # create the thread                                                     
        threads.append(                                                         
            threading.Thread(target=process, args=(items, start, end)))         
        threads[-1].start() # start the thread we just created                  

    # wait for all threads to finish                                            
    for t in threads:                                                           
        t.join()                                                                



split_processing(items)

有multiprocesing.pool，下面的示例演示了如何使用其中一个：
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool

pool_size = 5  # your "parallelness"

# define worker function before a Pool is instantiated
def worker(item):
    try:
        api.my_operation(item)
    except:
        print('error with item')

pool = Pool(pool_size)

for item in items:
    pool.apply_async(worker, (item,))

pool.close()
pool.join()

现在，如果您确实像@abarnert所提到的那样确定您的进程是CPU绑定的，那么将ThreadPool更改为进程池实现（在ThreadPool import下注释）。您可以在此处找到更多详细信息：
我是否需要安装concurrent？怎么用？我使用的是Python2.7.3，它找不到并发模块。“编辑”似乎仅在3.2中可用。糟糕。@doremi:我以为你用的是3.x，因为你调用的是print
函数，而不是语句。但是如果您使用的是2.x，您可以从安装futures
（例如，pip安装futures
），只需import futures
，而不是import concurrent.futures。（或者你可以使用“多处理，这并不复杂，它只意味着4行代码而不是3行代码。）这是一个非常深入的答案@abarnert，像往常一样：）请参阅我的答案，以获得一个高度可移植的实现版本（使用2.7.x提供的所有功能）。@jeffrey:是的。让我编辑一下答案，因为响应太长，无法放入注释中。@jeffrey:好吧，即使只有一个函数，如果它不依赖另一个迭代的未来，您也只能安全地与执行器一起使用它。使用两个函数确实会使不经意间依赖一个不存在的序列变得更容易，但这已经是可能的了。无论如何，避免这种情况的通常方法是显式地将序列放入，例如，将第二个函数安排在第一个函数的回调中。事实上
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool

pool_size = 5  # your "parallelness"

# define worker function before a Pool is instantiated
def worker(item):
    try:
        api.my_operation(item)
    except:
        print('error with item')

pool = Pool(pool_size)

for item in items:
    pool.apply_async(worker, (item,))

pool.close()
pool.join()

import numpy as np
import threading


def threaded_process(items_chunk):
    """ Your main process which runs in thread for each chunk"""
    for item in items_chunk:                                               
        try:                                                                    
            api.my_operation(item)                                              
        except Exception:                                                       
            print('error with item')  

n_threads = 20
# Splitting the items into chunks equal to number of threads
array_chunk = np.array_split(input_image_list, n_threads)
thread_list = []
for thr in range(n_threads):
    thread = threading.Thread(target=threaded_process, args=(array_chunk[thr]),)
    thread_list.append(thread)
    thread_list[thr].start()

for thread in thread_list:
    thread.join()