Python 3.x InvalidSchema:未找到任何连接适配器python3.5.2
我正在尝试从网页中提取电子邮件,以下是我的电子邮件抓取功能:Python 3.x InvalidSchema:未找到任何连接适配器python3.5.2,python-3.x,httprequest,Python 3.x,Httprequest,我正在尝试从网页中提取电子邮件,以下是我的电子邮件抓取功能: def emlgrb(x): email_set = set() for url in x: try: response = requests.get(url) soup = bs.BeautifulSoup(response.text, "lxml") emails = set(re.findall(r"[a-z0-9\.\-+_
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
此函数应由另一个函数提供,该函数创建url列表。馈线功能:
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
它会继续执行许多异常,如果引发这些异常,将返回空列表(不重要)。但是,get_links()的返回值如下所示:
["b'https://pythonprogramming.net/parsememcparseface//'"]
['https://pythonprogramming.net/parsememcparseface//']
当然,列表中有很多链接(不能发布-声誉)。emlgrb()函数无法处理列表(InvalidSchema:未找到任何连接适配器),但是如果手动删除b和冗余引号,则列表如下所示:
["b'https://pythonprogramming.net/parsememcparseface//'"]
['https://pythonprogramming.net/parsememcparseface//']
emlgrb()可以工作。欢迎任何关于问题所在或创建“清洁功能”以从第一个列表中获取第二个列表的建议
谢谢解决方案是删除
.encode('ascii')
您可以在str()
中添加编码,例如:str(object=b'',encoding='utf-8',errors='strict')
这是因为str()在
对象上调用。\uuuurepr\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。实际上,这就是当您执行打印(bytes\u obj)
时打印的内容。在str对象上调用.ecnode()
,将创建bytes对象 如果删除.encode('ascii'),输出会是什么样子?实际上,效果很好-谢谢。我认为在str()中也可以指定编码?如果你需要;)我在答案中添加了一些解释,效果好吗?:)很抱歉反应太晚。工作完全符合预期。谢谢