如何在连续循环中使用python多处理池_Python_Selenium Webdriver_Python Multiprocessing

如何在连续循环中使用python多处理池

python selenium-webdriver

如何在连续循环中使用python多处理池,python,selenium-webdriver,python-multiprocessing,Python,Selenium Webdriver,Python Multiprocessing,我正在使用python多处理库来执行selenium脚本。我的代码如下： #-- start and join multiple threads --- thread_list = [] total_threads=10 #-- no of parallel threads for i in range(total_threads): t = Process(target=get_browser_and_start, args=[url,nlp,pixel]) thread_li

我正在使用python多处理库来执行selenium脚本。我的代码如下：

#-- start and join multiple threads ---
thread_list = []
total_threads=10 #-- no of parallel threads
for i in range(total_threads):
    t = Process(target=get_browser_and_start, args=[url,nlp,pixel])
    thread_list.append(t)
    print "starting thread..."
    t.start()

for t in thread_list:
    print "joining existing thread..."
    t.join()

据我所知，

join（）

函数将等待每个进程完成。但我希望一旦一个进程被释放，它将被分配另一个任务来执行新的功能

可以这样理解：

from multiprocessing import Queue, Process

def worker(queue):
    while not queue.empty():
        task = queue.get()

        # now start to work on your task
        get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task

def main():
    queue = Queue()

    # Now put tasks into queue
    no_of_tasks_to_perform = 100

    for i in range(no_of_tasks_to_perform):
        queue.put([url, nlp, pixel, ...]) 

    # Now start all processes
    process = Process(target=worker, args=(queue, ))
    process.start()
    ...
    process.join()

假设在第一个实例中启动了8个进程

no_of_tasks_to_perform = 100

for i in range(no_of_tasks_to_perform):
    processes start(8)
    if process no 2 finished executing, start new process
    maintain 8 process at any point of time till 
    "i" is <= no_of_tasks_to_perform

要执行的任务数量=100
对于范围内的i（没有要执行的任务）：
进程启动（8）
如果2号进程已完成执行，则启动新进程
在任何时间点维护8个流程，直到
“i”是而不是偶尔启动新进程，尝试将所有任务放入多处理.Queue（）
，并启动8个长时间运行的进程，在每个进程中保持访问任务队列以获取新任务，然后执行作业，直到不再有任务为止
在你的情况下，它更像这样：
from multiprocessing import Queue, Process

def worker(queue):
    while not queue.empty():
        task = queue.get()

        # now start to work on your task
        get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task

def main():
    queue = Queue()

    # Now put tasks into queue
    no_of_tasks_to_perform = 100

    for i in range(no_of_tasks_to_perform):
        queue.put([url, nlp, pixel, ...]) 

    # Now start all processes
    process = Process(target=worker, args=(queue, ))
    process.start()
    ...
    process.join()

@shane，这个设置中的8个处理器在哪里？应该是简单的：process.start（8）
。我有一个自定义的python模块，我可以初始化这个类来建立webDriver实例，然后使用队列中的参数调用我的scraping函数。但是，我不需要将8个不同的WebDriver实例化到一个池中吗？因为我想知道一个X-window帧缓冲区（Xvfb）和无头chromedriver实例如何作为8个不同的进程来执行一个任务队列（以千计）？在这个设置中，您实际上手动启动了8个进程（或任何您想要的），并使每个进程都成为一个长时间运行的进程，以不断获取新任务（在您的情况下，实例化浏览器并执行任务），例如process1=process（target=worker，args=（queue，）
<代码>进程8…

。如果你想使用
多处理.Pool
，你需要使用
map
来传递你的函数，这是一种不同的设置，但在你的情况下，它实际上并不那么方便，特别是当涉及到多个参数时，请检查以下内容：@shane，'hmm。。。我错过了你开头那句粗体字；对不起/谢谢。另一种选择是通过计时线索的执行来控制启动进程的数量，允许进程在新进程启动时消失。这实际上有助于我更好地控制每小时通过代理交换机发出的请求数量，而处理器的数量可能因I/O和等待而有所不同。(???)