Python 多处理线程池在结束时停止
我编写了一个脚本,可以“解析”文件中的所有域。发布后,一切正常。但当最后剩下几个域时,它就会被卡住。有时解析最后两个域需要很长时间。我想不出是什么问题。谁曾面对过这样的局面?告诉我怎么治好它 在发布之后,一切都会很快(正如它应该的那样)完成,直到最后。最后,当剩下几个域时,它停止。没有差异,1000个域或10000个域 完整代码:Python 多处理线程池在结束时停止,python,multiprocessing,threadpool,Python,Multiprocessing,Threadpool,我编写了一个脚本,可以“解析”文件中的所有域。发布后,一切正常。但当最后剩下几个域时,它就会被卡住。有时解析最后两个域需要很长时间。我想不出是什么问题。谁曾面对过这样的局面?告诉我怎么治好它 在发布之后,一切都会很快(正如它应该的那样)完成,直到最后。最后,当剩下几个域时,它停止。没有差异,1000个域或10000个域 完整代码: import re import sys import json import requests from bs4 import BeautifulSoup from
import re
import sys
import json
import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
pool = 100
with open("Rules.json") as file:
REGEX = json.loads(file.read())
ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0'}
def Domain_checker(domain):
try:
r = requests.get("http://" + domain, verify=False, headers=ua)
r.encoding = "utf-8"
for company in REGEX.keys():
for type in REGEX[company]:
check_entry = 0
for ph_regex in REGEX[company][type]:
if bool(re.search(ph_regex, r.text)) is True:
check_entry += 1
if check_entry == len(REGEX[company][type]):
title = BeautifulSoup(r.text, "lxml")
Found_domain = "\nCompany: {0}\nRule: {1}\nURL: {2}\nTitle: {3}\n".format(company, type, r.url, title.title.text)
print(Found_domain)
with open("/tmp/__FOUND_DOMAINS__.txt", "a", encoding='utf-8', errors = 'ignore') as file:
file.write(Found_domain)
except requests.exceptions.ConnectionError:
pass
except requests.exceptions.TooManyRedirects:
pass
except requests.exceptions.InvalidSchema:
pass
except requests.exceptions.InvalidURL:
pass
except UnicodeError:
pass
except requests.exceptions.ChunkedEncodingError:
pass
except requests.exceptions.ContentDecodingError:
pass
except AttributeError:
pass
except ValueError:
pass
return domain
if __name__ == '__main__':
with open(sys.argv[1], "r", encoding='utf-8', errors = 'ignore') as file:
Domains = file.read().split()
pool = 100
print("Pool = ", pool)
results = ThreadPool(pool).imap_unordered(Domain_checker, Domains)
string_num = 0
for result in results:
print("{0} => {1}".format(string_num, result))
string_num += 1
with open("/tmp/__FOUND_DOMAINS__.txt", encoding='utf-8', errors = 'ignore') as found_domains:
found_domains = found_domains.read()
print("{0}\n{1}".format("#" * 40, found_domains))
安装超时后,问题得到解决
感谢昵称为“eri”():)的用户。其中一个块可能正在抑制引发的异常。至少打印异常。异常中没有由于线程池而导致的错误。只有与域、编码等不可用性相关的异常才是LIGAT0R:因为您正在抑制所有这些异常,所以您不知道是什么导致了它们。顺便说一句,您可以在一个
except
子句中以相同的方式处理多个异常,方法是创建它们的元组:即except(requests.Exceptions.ConnectionError、requests.Exceptions.TooManyRedirects等):
。
requests.get("http://" + domain, headers=ua, verify=False, timeout=10)