Python 多进程之间的下载速率限制

Python 多进程之间的下载速率限制,python,python-multiprocessing,multiprocessing-manager,Python,Python Multiprocessing,Multiprocessing Manager,我想从网站上下载和处理很多文件。该网站的服务条款限制了每秒允许下载的文件数量 处理文件所需的时间实际上是瓶颈,因此我希望能够并行处理多个文件。但我不希望不同的进程组合起来违反下载限制。所以我需要一些限制过度请求率的东西。我的想法如下,但我并不完全是多处理模块的专家 import multiprocessing from multiprocessing.managers import BaseManager import time class DownloadLimiter(object):

我想从网站上下载和处理很多文件。该网站的服务条款限制了每秒允许下载的文件数量

处理文件所需的时间实际上是瓶颈,因此我希望能够并行处理多个文件。但我不希望不同的进程组合起来违反下载限制。所以我需要一些限制过度请求率的东西。我的想法如下,但我并不完全是
多处理
模块的专家

import multiprocessing
from multiprocessing.managers import BaseManager
import time

class DownloadLimiter(object):

    def __init__(self, time):
        self.time = time
        self.lock = multiprocessing.Lock()

    def get(self, url):
        self.lock.acquire()
        time.sleep(self.time)
        self.lock.release()
        return url


class DownloadManager(BaseManager):
    pass

DownloadManager.register('downloader', DownloadLimiter)


class Worker(multiprocessing.Process):

    def __init__(self, downloader, queue, file_name):
        super().__init__()
        self.downloader = downloader
        self.file_name = file_name
        self.queue = queue

    def run(self):
        while not self.queue.empty():
            url = self.queue.get()
            content = self.downloader.get(url)
            with open(self.file_name, "a+") as fh:
                fh.write(str(content) + "\n")
然后在其他地方运行下载

manager = DownloadManager()
manager.start()
downloader = manager.downloader(0.5)
queue = multiprocessing.Queue()

urls = range(50)
for url in urls:
    queue.put(url)

job1 = Worker(downloader, queue, r"foo.txt")
job2 = Worker(downloader, queue, r"bar.txt")
jobs = [job1, job2]

for job in jobs:
    job.start()

for job in jobs:
    job.join()
这似乎在小范围内完成了这项工作,但我有点担心锁定是否真的正确完成


另外,如果有更好的模式来实现同样的目标,我很乐意听到。这可以通过一个用于并行和分布式Python的库干净地完成

光线中的资源

启动Ray时,您可以告诉它该机器上有哪些可用的资源。Ray将自动尝试确定CPU内核的数量和GPU的数量,但这些都可以指定,事实上,也可以通过调用

ray.init(num_cpus=4, resources={'Network': 2})
这告诉Ray这台机器有4个CPU核和2个用户定义的资源,称为
网络

每个Ray“任务”是一个可调度的工作单元,都有一定的资源需求。默认情况下,一个任务只需要1个CPU核心,其他什么都不需要。但是,可以通过使用声明相应的函数来指定任意资源需求

@ray.remote(resources={'Network': 1})
def f():
    pass
这告诉Ray,为了在“工作”进程上执行
f
,必须有1个CPU内核(默认值)和1个可用的
Network
资源

由于机器有2个
网络
资源和4个CPU核,因此最多可以同时执行
f
的2个副本。另一方面,如果有另一个函数
g

@ray.remote
def g():
    pass
然后可以同时执行四份
g
,或者可以同时执行两份
f
和两份
g

示例

下面是一个示例,其中包含用于下载内容和处理内容的实际函数的占位符

import ray
import time

max_concurrent_downloads = 2

ray.init(num_cpus=4, resources={'Network': max_concurrent_downloads})

@ray.remote(resources={'Network': 1})
def download_content(url):
    # Download the file.
    time.sleep(1)
    return 'result from ' + url

@ray.remote
def process_result(result):
    # Process the result.
    time.sleep(1)
    return 'processed ' + result

urls = ['url1', 'url2', 'url3', 'url4']

result_ids = [download_content.remote(url) for url in urls]

processed_ids = [process_result.remote(result_id) for result_id in result_ids]

# Wait until the tasks have finished and retrieve the results.
processed_results = ray.get(processed_ids)
这里是一个时间线描述(可以通过从命令行运行
raytimene
并在中打开生成的JSON文件来生成)chrome://tracing 在Chrome web浏览器中)

在上面的脚本中,我们提交了4个
下载内容
任务。这些是我们通过指定它们需要
网络
资源(除了默认的1个CPU资源)来对其进行评级限制的资源。然后我们提交4个
process\u result
任务,每个任务都需要默认的1个CPU资源。任务分三个阶段执行(只需查看蓝色的框)

  • 我们首先执行2个
    下载内容
    任务,这是一次最多可以执行的任务(由于速率限制)。我们还不能执行任何
    过程\结果
    任务,因为它们取决于
    下载\内容
    任务的输出
  • 这些任务完成后,我们开始执行剩下的两个
    下载内容任务以及两个
    处理结果任务,因为我们没有对
    处理结果任务进行费率限制
  • 我们执行剩余的
    过程\结果
    任务
  • 每个“行”是一个辅助进程。时间从左到右


    您可以在上查看有关如何执行此操作的更多信息。

    有一个库正好满足您的需要,名为

    他们主页上的示例:

    此函数将无法在15分钟的时间段内进行超过15次的API调用

    from ratelimit import limits
    
    import requests
    
    FIFTEEN_MINUTES = 900
    
    @limits(calls=15, period=FIFTEEN_MINUTES)
    def call_api(url):
        response = requests.get(url)
    
        if response.status_code != 200:
            raise Exception('API response: {}'.format(response.status_code))
        return response
    
    顺便说一下,在I/O密集型任务(如web爬网)中,您可以使用多线程,而不是多处理。在使用多处理时,您必须创建另一个用于控制的进程,并协调您所做的所有工作。在多线程方法的情况下,所有线程本身都可以访问主进程内存,因此信令变得更加容易(因为线程之间共享
    e
    ):


    最简单的方法是在主线程上下载文档并将文档馈送到工作池

    在我自己的实现中,我使用芹菜处理文档,使用gevent下载。它做同样的事情只是更加复杂

    这里有一个简单的例子

    import multiprocessing
    from multiprocessing import Pool
    import time
    import typing
    
    def work(doc: str) -> str:
        # do some processing here....
        return doc + " processed"
    
    def download(url: str) -> str:
        return url  # a hack for demo, use e.g. `requests.get()`
    
    def run_pipeline(
        urls: typing.List[str],
        session_request_limit: int = 10,
        session_length: int = 60,
    ) -> None:
        """
        Download and process each url in `urls` at a max. rate limit
        given by `session_request_limit / session_length`
        """
        workers = Pool(multiprocessing.cpu_count())
        results = []
    
        n_requests = 0
        session_start = time.time()
    
        for url in urls:
            doc = download(url)
            results.append(
                workers.apply_async(work, (doc,))
            )
            n_requests += 1
    
            if n_requests >= session_request_limit:
                time_to_next_session = session_length - time.time() - session_start
                time.sleep(time_to_next_session)
    
            if time.time() - session_start >= session_length:
                session_start = time.time()
                n_requests = 0
    
        # Collect results
        for result in results:
            print(result.get())
    
    if __name__ == "__main__":
        urls = ["www.google.com", "www.stackoverflow.com"]
        run_pipeline(urls)
    

    现在还不太清楚你在“下载速率限制”下的意思。在这种情况下,它是大量并发下载,这是一个常见的用例,我认为简单的解决方案是将信号量与进程池一起使用:

    #!/usr/bin/env python3
    import os
    import time
    import random
    from functools import partial
    from multiprocessing import Pool, Manager
    
    
    CPU_NUM = 4
    CONCURRENT_DOWNLOADS = 2
    
    
    def download(url, semaphore):
        pid = os.getpid()
    
        with semaphore:
            print('Process {p} is downloading from {u}'.format(p=pid, u=url))
            time.sleep(random.randint(1, 5))
    
        # Process the obtained resource:
        time.sleep(random.randint(1, 5))
    
        return 'Successfully processed {}'.format(url)
    
    
    def main():
        manager = Manager()
    
        semaphore = manager.Semaphore(CONCURRENT_DOWNLOADS)
        target = partial(download, semaphore=semaphore)
    
        urls = ['https://link/to/resource/{i}'.format(i=i) for i in range(10)]
    
        with Pool(processes=CPU_NUM) as pool:
            results = pool.map(target, urls)
    
        print(results)
    
    
    if __name__ == '__main__':
        main()
    

    正如您所看到的,一次只有
    并发加载
    进程在下载,而其他进程则忙于处理获取的资源。

    好,在OP的以下说明之后

    我所说的“每秒下载量”是指全球范围内每秒开始的下载量不超过10次

    我决定发布另一个答案,因为我认为我的第一个答案对于那些希望限制并发运行进程数量的人来说可能也很有趣

    我认为没有必要使用额外的框架来解决这个问题。其思想是使用为每个资源链接生成的下载线程、一个资源队列和固定数量的处理工作者,这些工作者是进程,而不是线程:

    #!/usr/bin/env python3
    import os
    import time
    import random
    from threading import Thread
    from multiprocessing import Process, JoinableQueue
    
    
    WORKERS = 4
    DOWNLOADS_PER_SECOND = 2
    
    
    def download_resource(url, resource_queue):
        pid = os.getpid()
    
        t = time.strftime('%H:%M:%S')
        print('Thread {p} is downloading from {u} ({t})'.format(p=pid, u=url, t=t),
              flush=True)
        time.sleep(random.randint(1, 10))
    
        results = '[resource {}]'.format(url)
        resource_queue.put(results)
    
    
    def process_resource(resource_queue):
        pid = os.getpid()
    
        while True:
            res = resource_queue.get()
    
            print('Process {p} is processing {r}'.format(p=pid, r=res),
                  flush=True)
            time.sleep(random.randint(1, 10))
    
            resource_queue.task_done()
    
    
    def main():
        resource_queue = JoinableQueue()
    
        # Start process workers:
        for _ in range(WORKERS):
            worker = Process(target=process_resource,
                             args=(resource_queue,),
                             daemon=True)
            worker.start()
    
        urls = ['https://link/to/resource/{i}'.format(i=i) for i in range(10)]
    
        while urls:
            target_urls = urls[:DOWNLOADS_PER_SECOND]
            urls = urls[DOWNLOADS_PER_SECOND:]
    
            # Start downloader threads:
            for url in target_urls:
                downloader = Thread(target=download_resource,
                                    args=(url, resource_queue),
                                    daemon=True)
                downloader.start()
    
            time.sleep(1)
    
        resource_queue.join()
    
    
    if __name__ == '__main__':
        main()
    
    结果如下所示:

    $ ./limit_download_rate.py
    Thread 32482 is downloading from https://link/to/resource/0 (10:14:08)
    Thread 32482 is downloading from https://link/to/resource/1 (10:14:08)
    Thread 32482 is downloading from https://link/to/resource/2 (10:14:09)
    Thread 32482 is downloading from https://link/to/resource/3 (10:14:09)
    Thread 32482 is downloading from https://link/to/resource/4 (10:14:10)
    Thread 32482 is downloading from https://link/to/resource/5 (10:14:10)
    Process 32483 is processing [resource https://link/to/resource/2]
    Process 32484 is processing [resource https://link/to/resource/0]
    Thread 32482 is downloading from https://link/to/resource/6 (10:14:11)
    Thread 32482 is downloading from https://link/to/resource/7 (10:14:11)
    Process 32485 is processing [resource https://link/to/resource/1]
    Process 32486 is processing [resource https://link/to/resource/3]
    Thread 32482 is downloading from https://link/to/resource/8 (10:14:12)
    Thread 32482 is downloading from https://link/to/resource/9 (10:14:12)
    Process 32484 is processing [resource https://link/to/resource/6]
    Process 32485 is processing [resource https://link/to/resource/9]
    Process 32483 is processing [resource https://link/to/resource/8]
    Process 32486 is processing [resource https://link/to/resource/4]
    Process 32485 is processing [resource https://link/to/resource/7]
    Process 32483 is processing [resource https://link/to/resource/5]
    
    这里,每秒
    DOWNLOADS\u/code>线程都在启动,本例中有两个线程,然后下载并将资源放入队列。
    WORKERS
    是从队列中获取资源的多个进程
    $ ./limit_download_rate.py
    Thread 32482 is downloading from https://link/to/resource/0 (10:14:08)
    Thread 32482 is downloading from https://link/to/resource/1 (10:14:08)
    Thread 32482 is downloading from https://link/to/resource/2 (10:14:09)
    Thread 32482 is downloading from https://link/to/resource/3 (10:14:09)
    Thread 32482 is downloading from https://link/to/resource/4 (10:14:10)
    Thread 32482 is downloading from https://link/to/resource/5 (10:14:10)
    Process 32483 is processing [resource https://link/to/resource/2]
    Process 32484 is processing [resource https://link/to/resource/0]
    Thread 32482 is downloading from https://link/to/resource/6 (10:14:11)
    Thread 32482 is downloading from https://link/to/resource/7 (10:14:11)
    Process 32485 is processing [resource https://link/to/resource/1]
    Process 32486 is processing [resource https://link/to/resource/3]
    Thread 32482 is downloading from https://link/to/resource/8 (10:14:12)
    Thread 32482 is downloading from https://link/to/resource/9 (10:14:12)
    Process 32484 is processing [resource https://link/to/resource/6]
    Process 32485 is processing [resource https://link/to/resource/9]
    Process 32483 is processing [resource https://link/to/resource/8]
    Process 32486 is processing [resource https://link/to/resource/4]
    Process 32485 is processing [resource https://link/to/resource/7]
    Process 32483 is processing [resource https://link/to/resource/5]