Python 在结合使用多处理和请求时，有没有更好的方法来避免内存泄漏？_Python_Web Scraping_Memory Leaks_Request_Python Multiprocessing

Python 在结合使用多处理和请求时，有没有更好的方法来避免内存泄漏？

python web-scraping memory-leaks

Python 在结合使用多处理和请求时，有没有更好的方法来避免内存泄漏？,python,web-scraping,memory-leaks,request,python-multiprocessing,Python,Web Scraping,Memory Leaks,Request,Python Multiprocessing,我目前正在使用请求模块和多处理来处理单个目标我正在使用池和多处理异步每个进程发送一系列连续的请求，在每个请求中我随机切换头（用户代理）和代理过了一会儿，我注意到电脑速度变慢了，所有的请求在所有脚本上都失败了经过一番挖掘，我意识到问题不在于代理，而在于请求中的内存泄漏我读过其他关于多处理内存泄漏的文章我的问题是，有没有更好的方法来避免这种情况，而不是使用：if uuuuu_uuu名称uuuuuuu=='uuuuuuuuu主键uuuuuu'：（也许每一次tot迭代都会转储内存，或者类似

我目前正在使用请求模块和多处理来处理单个目标

我正在使用池和多处理异步

每个进程发送一系列连续的请求，在每个请求中我随机切换头（用户代理）和代理

过了一会儿，我注意到电脑速度变慢了，所有的请求在所有脚本上都失败了

经过一番挖掘，我意识到问题不在于代理，而在于请求中的内存泄漏

我读过其他关于多处理内存泄漏的文章

我的问题是，有没有更好的方法来避免这种情况，而不是使用：if uuuuu_uuu名称uuuuuuu=='uuuuuuuuu主键uuuuuu'：

（也许每一次tot迭代都会转储内存，或者类似的情况？）

下面是我的代码：

a = [[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)]]

def hydra_gecko(file_name, initial_letter, final_letter, process_number):
    # url and proxy details here
    response = requests.get(url, headers=header_switcher(), proxies={'http': proxy, 'https': proxy}, timeout=(1, 3))
    # parse html and gather data


for multi_arguments in a:
if __name__ == '__main__':
    with Pool(5) as p:
        print(p.starmap_async(hydra_gecko, multi_arguments))
        p.close()
        p.join()

有更好的方法吗？是否有代码可以在每次tot迭代或类似的情况下转储内存，比上述代码更好？

谢谢

您正在为每个

多参数

创建一个新池。那是浪费资源。如果工作进程的总数超过了CPU的核心数量，那么工作进程将争夺CPU资源，甚至内存，从而降低整个进程的速度

池的全部用途是处理比辅助函数更多的项

请改为尝试以下操作（使用单个池）：

a = [
  ('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3),
  ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2),
  ('ae.txt', 'ae', 'aeo', 4), ('ab.txt', 'ab', 'abo', 1),
  ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5),
  ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo',4)
  ('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3),
  ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2),
  ('ae.txt', 'ae', 'aeo', 4), ('ab.txt', 'ab', 'abo', 1),
  ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5),
  ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)
]

def hydra_gecko(item):
    file_name, initial_letter, final_letter, process_number = item
    # url and proxy details here
    response = requests.get(
      url, headers=header_switcher(),
      proxies={'http': proxy, 'https': proxy},
      timeout=(1, 3)
    )
    # parse html and gather data, return result.
    return response.status_code

if __name__ == '__main__':
# Do **not** choose a number of workers. The default usually works fine.
# If you are worried about memory leaks, set maxtasksperchild
# to refresh the worker process after a certain number of tasks.
with Pool(maxtasksperchild=4) as p:
    for result in p.imap_unordered(hydra_gecko, a):
        print(result)