Python 是什么导致了并发的.futures死锁?代码包括

Python 是什么导致了并发的.futures死锁?代码包括,python,python-requests,concurrent.futures,Python,Python Requests,Concurrent.futures,我有一个并发的.futures抓取脚本,用于低级别的东西。然而,它开始出现问题。它被卡住了,永远不会结束 我能够将问题缩小到17个URL(从一批18k的URL中,你可以想象这是多么有趣)。这17个URL中的一个或多个肯定发生了什么事情,导致了暂停(死锁?),尽管我对请求和未来都使用了超时。奇怪的是,似乎不是一个URL造成了这种情况。当我运行代码时,我会得到关于url完成的日志。实际完成的一批URL似乎每次都在更改,因此似乎没有一个URL是我可以指出的罪魁祸首 欢迎任何帮助 (按原样运行函数。不要

我有一个并发的.futures抓取脚本,用于低级别的东西。然而,它开始出现问题。它被卡住了,永远不会结束

我能够将问题缩小到17个URL(从一批18k的URL中,你可以想象这是多么有趣)。这17个URL中的一个或多个肯定发生了什么事情,导致了暂停(死锁?),尽管我对请求和未来都使用了超时。奇怪的是,似乎不是一个URL造成了这种情况。当我运行代码时,我会得到关于url完成的日志。实际完成的一批URL似乎每次都在更改,因此似乎没有一个URL是我可以指出的罪魁祸首

欢迎任何帮助

(按原样运行函数。不要使用runBad=False,因为它需要一个元组列表。)

EDIT1:ProcessPoolExecutor也会发生这种情况

EDIT2:问题似乎与重试有关。 当我注释掉这三行并使用一个普通的
requests.get
,它就可以顺利完成了。但为什么呢?这可能是由于如何实现重试和并发.futures之间的兼容性问题造成的吗

#    s = requests.Session()
#    retries = Retry(total=1, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = místo RetryError vrátí response
#    s.mount("https://", HTTPAdapter(max_retries=retries))
EDIT3:即使是这个简单的请求也不起作用。因此,它确实需要安装HTTPAdapter/max_重试。甚至尝试了一个没有urlib3的
Retry()
,只是
max\u retries=2
。还是不行。提出了一个问题,看看我们是否遗漏了什么-:

这是原始的concurrent.futures代码:

import requests
import concurrent.futures
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests.exceptions import HTTPError
from requests.exceptions import SSLError
from requests.exceptions import ConnectionError
from requests.exceptions import Timeout
from requests.exceptions import TooManyRedirects
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) # disabled SSL warnings

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
TIMEOUT = 5

def getMultiRequest(url, runBad, bad_request, tTimeout):
    #print("url = ", url)
    s = requests.Session()
    retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = instead of RetryError returns a response
    s.mount("https://", HTTPAdapter(max_retries=retries))
    if runBad == False:
        try:
            response = s.get(url, headers=HEADERS, timeout=tTimeout, verify=False)
           
                                            # Processing stuff // some can be pretty long (Levenstein etc)
               
            ret = (url, response.url, response.status_code, "", len(response.content), "", "", "")
        except HTTPError as e:
            ret = (url, "", e.response.status_code, "", 0, "", "", False)
        except SSLError:
            ret = (url, "", 0, "SSL certificate verification failed", 0, "", "", False)
        except ConnectionError:
            ret = (url, "", 0, "Cannot establish connection", 0, "", "", False)
        except Timeout:
            ret = (url, "", 0, "Request timed out", 0, "", "", False)
        except TooManyRedirects:
            ret = (url, "", 0, "Too many redirects", 0, "", "", False)
        except Exception:
            ret = (url, "", 0, "Undefined exception", 0, "", "", False)
        return ret
    else:
        try:
            response = s.get(url, headers=HEADERS, timeout=tTimeout, verify=False)
           
                                            # Processing stuff // some can be pretty long (Levenstein etc)
               
            ret = (url, response.url, response.status_code, "", "")
        except Exception:
            ret = (url, "", 0, "", "")
        return ret

def getMultiRequestThreaded(urlList, runBad, logURLs, tOut):
    responseList = []
    if runBad == True:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future_to_url = {executor.submit(getMultiRequest, url, runBad, "", tOut): url for url in urlList}
            for future in concurrent.futures.as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    data = future.result(timeout=30)
                except Exception as exc:
                    data = (url, 0, str(type(exc)))
                finally:
                    if logURLs == True:
                        print("BAD URL done: '" + url + "'.")
                    responseList.append(data)
    else:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future_to_url = {executor.submit(getMultiRequest, url[0], runBad, url[1], tOut): url for url in urlList}
            for future in concurrent.futures.as_completed(future_to_url):
                url = future_to_url[future][0]
                try:
                    data = future.result(timeout=30)
                except Exception as exc:
                    data = (url, 0, str(type(exc)))
                finally:
                    if logURLs == True:
                        print("LEGIT URL done: '" + url + "'.")
                    responseList.append(data)
    return responseList

URLs = [
    'https://www.appyhere.com/en-us',
    'https://jobilant.work/da',
    'https://www.iworkjobsite.com.au/jobseeker-home.htm',
    'https://youtrust.jp/lp',
    'https://passioneurs.net/ar',
    'https://employdentllc.com',
    'https://www.ivvajob.com/default/index',
    'https://praceapp.com/en',
    'https://www.safecook.be/en/home-en',
    'https://www.ns3a.com/en',
    'https://www.andjaro.com/en/home',
    'https://sweatcoin.club/',
    'https://www.pursuitae.com',
    'https://www.jobpal.ai/en',
    'https://www.clinicoin.io/en',
    'https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager',
    'https://dott.one/index.html'
]

output = getMultiRequestThreaded(URLs, True, True, TIMEOUT)

我能够重现死锁,我不确定它为什么会发生,但使用
多处理.ThreadPool()
它不会发生:


我能够重现死锁,我不确定它为什么会发生,但使用
多处理.ThreadPool()
它不会发生:


我修改了程序,将所有URL添加到一个集合中,当每个URL的获取在concurrent.futures中的
for future循环中完成时(无论好坏)。当_完成时(future _to _URL):
,我从集合中删除URL并打印出当前集合的内容。这样,当程序最终挂起时,我就会知道还有什么需要完成:它总是URL和URL

当我试图自己获取这些URL时,它们都返回503个服务不可用错误。所以当我注释掉下面两行时,程序运行到完成

retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = instead of RetryError returns a response
s.mount("https://", HTTPAdapter(max_retries=retries))
仅仅从列表中删除代码503是没有帮助的。此规范中可能存在其他错误(尽管它看起来是正确的,而不是一个相当大的
退避系数
,我减少了退避系数,以确保我等待的时间足够长),或者
请求
urllib3
存在错误

下面是变量
输出中每个结果的打印输出:

('https://www.appyhere.com/en-us', 'https://www.appyhere.com/en-us', 200, '', '')
('https://www.iworkjobsite.com.au/jobseeker-home.htm', 'https://www.iworkjobsite.com.au/jobseeker-home.htm', 200, '', '')
('https://passioneurs.net/ar', 'https://passioneurs.net/ar', 404, '', '')
('https://youtrust.jp/lp', 'https://youtrust.jp/lp', 200, '', '')
('https://jobilant.work/da', 'https://jobilant.work/da/', 200, '', '')
('https://employdentllc.com', 'https://employdentllc.com/', 503, '', '')
('https://www.ivvajob.com/default/index', 'https://www.ivvajob.com/default/index', 200, '', '')
('https://www.ns3a.com/en', 'https://www.ns3a.com/en', 200, '', '')
('https://www.safecook.be/en/home-en', 'https://www.safecook.be/en/home-en/', 200, '', '')
('https://sweatcoin.club/', 'https://sweatcoin.club/', 200, '', '')
('https://www.andjaro.com/en/home', 'https://www.andjaro.com/en/home/', 200, '', '')
('https://praceapp.com/en', 'https://praceapp.com/en/', 200, '', '')
('https://www.clinicoin.io/en', 'https://www.clinicoin.io/en', 200, '', '')
('https://www.jobpal.ai/en', 'https://www.jobpal.ai/en/', 200, '', '')
('https://dott.one/index.html', 'https://dott.one/index.html', 200, '', '')
('https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 'https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 404, '', '')
('https://www.pursuitae.com', 'https://www.pursuitae.com/', 503, '', '')
更新

我发现了问题。您需要
尊重\u头=False后重试\u参数:

retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False, respect_retry_after_header=False) # raise_on_status=False = instead of RetryError returns a response
您可能还希望将
退避系数
减少到1


现在,这似乎是的副本。

我修改了程序,将所有URL添加到一个集合中,当每个URL的获取在并发的.futures中为future完成时(无论是更好还是更糟)。当_完成时(future _to _URL):
,我从集合中删除了URL,并打印出当前集合的内容。这样,当程序最终挂起时,我就会知道还有什么需要完成:它总是URL和URL

当我试图自己获取这些URL时,它们都返回503个服务不可用错误。所以当我注释掉下面两行时,程序运行到完成

retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = instead of RetryError returns a response
s.mount("https://", HTTPAdapter(max_retries=retries))
仅仅从列表中删除代码503是没有帮助的。此规范中可能存在其他错误(尽管它看起来是正确的,而不是一个相当大的
退避系数
,我减少了退避系数,以确保我等待的时间足够长),或者
请求
urllib3
存在错误

下面是变量
输出中每个结果的打印输出:

('https://www.appyhere.com/en-us', 'https://www.appyhere.com/en-us', 200, '', '')
('https://www.iworkjobsite.com.au/jobseeker-home.htm', 'https://www.iworkjobsite.com.au/jobseeker-home.htm', 200, '', '')
('https://passioneurs.net/ar', 'https://passioneurs.net/ar', 404, '', '')
('https://youtrust.jp/lp', 'https://youtrust.jp/lp', 200, '', '')
('https://jobilant.work/da', 'https://jobilant.work/da/', 200, '', '')
('https://employdentllc.com', 'https://employdentllc.com/', 503, '', '')
('https://www.ivvajob.com/default/index', 'https://www.ivvajob.com/default/index', 200, '', '')
('https://www.ns3a.com/en', 'https://www.ns3a.com/en', 200, '', '')
('https://www.safecook.be/en/home-en', 'https://www.safecook.be/en/home-en/', 200, '', '')
('https://sweatcoin.club/', 'https://sweatcoin.club/', 200, '', '')
('https://www.andjaro.com/en/home', 'https://www.andjaro.com/en/home/', 200, '', '')
('https://praceapp.com/en', 'https://praceapp.com/en/', 200, '', '')
('https://www.clinicoin.io/en', 'https://www.clinicoin.io/en', 200, '', '')
('https://www.jobpal.ai/en', 'https://www.jobpal.ai/en/', 200, '', '')
('https://dott.one/index.html', 'https://dott.one/index.html', 200, '', '')
('https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 'https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 404, '', '')
('https://www.pursuitae.com', 'https://www.pursuitae.com/', 503, '', '')
更新

我发现了问题。您需要
尊重\u头=False后重试\u参数:

retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False, respect_retry_after_header=False) # raise_on_status=False = instead of RetryError returns a response
您可能还希望将
退避系数
减少到1


现在,这似乎是的副本。

如果使用
ProcessPoolExecutor
,是否也会发生死锁?@AKX不适用于此批处理,不适用。但是,我的印象是,不希望在请求中使用
ProcessPoolExecutor
。但是我要对它进行一些测试。我不明白为什么在这里使用进程池c.f.线程池会有任何显著的区别,因为您没有显式地共享请求会话。(共享一个将有助于连接池。)@AKX说得太早了:
@AKX-如果使用
ProcessPoolExecutor
,则能够解决上述问题。@AKX不适用于此批处理,但是,我的印象是,对请求使用
ProcessPoolExecutor
是不可取的。但是我要对它进行一些测试。我不明白为什么在这里使用进程池c.f.线程池会有任何显著的区别,因为您没有显式地共享请求会话。(共享一个将有助于连接池。)@AKX说得太早了:
@AKX-能够通过
解决上述问题,如果uuuu name_uuu=='\uuuu main_uuu
,但即使使用
ProcessPoolExecutor
您实际上已经改变了两件事,即池方法和错误恢复,这两件事现在已经不存在了。concurrent.futures
没有什么问题,只是尝试对几个有问题的URL进行重试时出现了一些问题。看我的答案。你实际上改变了两件事,即