python尝试并排除url更正python 3
我正在尝试从网页中获取HTML。然而,并非所有的URL都被正确写入。列表中大多数无效URL包括http,但现在URL使用https。有些缺少“www.”,有些则需要添加“www.”python尝试并排除url更正python 3,python,python-3.x,url,web-scraping,python-requests,Python,Python 3.x,Url,Web Scraping,Python Requests,我正在尝试从网页中获取HTML。然而,并非所有的URL都被正确写入。列表中大多数无效URL包括http,但现在URL使用https。有些缺少“www.”,有些则需要添加“www.” def repl_www_http(url): x = url.replace("www.", "") y = x.replace("http", "https") return y def repl_www(url): y = url.replace("www.", "")
def repl_www_http(url):
x = url.replace("www.", "")
y = x.replace("http", "https")
return y
def repl_www(url):
y = url.replace("www.", "")
return y
def repl_http(url):
y = url.replace("http", "https")
return y
def repl_no_www(url):
y = url.replace("//", "//www.")
return y
def get_html(urllist):
for i in urllist:
html = ""
try:
html = requests.get(i)
html = html.text
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_http(i))
html = html.text
print("replaced // with //www.")
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_http(i))
html = html.text
print("replaced http with https")
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_www(i))
html = html.text
print("replaced www. with .")
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_www_http(i))
html = html.text
print("replaced www with . and http with https")
return html
except requests.exceptions.ConnectionError:
return "no HTML found on this URL"
print("gethtml finished", html)
这就是我得到的错误:
Traceback (most recent call last): File "C:\replacer.py", line 76, in <module> html = get_html(i)
File "C:\replacer.py", line 37, in get_html html = requests.get(repl_http(i))
File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 498, in request
prep = self.prepare_request(req) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 441, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",line 309, in prepare
self.prepare_url(url, params) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",
line 383, in prepare_url
raise MissingSchema(error)requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
Traceback(最近一次调用last):文件“C:\replacer.py”,第76行,html=get\u html(i)
文件“C:\replacer.py”,第37行,在get\u html=requests.get(repl\u http(i))中
文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\37\lib\site packages\requests-2.19.1-py3.7.egg\requests\api.py”,get第72行
返回请求('get',url,params=params,**kwargs)文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\api.py”,请求中第58行
return session.request(method=method,url=url,**kwargs)文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\sessions.py”,请求中第498行
prep=self.prepare_请求(req)文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\sessions.py”,第441行,prepare_请求
钩子=合并钩子(request.hooks,self.hooks),
文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\37\lib\site packages\requests-2.19.1-py3.7.egg\requests\models.py”,第309行,在prepare中
self.prepare_url(url,params)文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\models.py”,
准备url中的第383行
raise MissingSchema(错误)请求。异常。MissingSchema:无效URL“h”:未提供架构。也许你的意思是http://h?
如何解决此问题以更正错误的URL?问题是URL传递给requests.get()send MissingSchema错误,您应该在捕获ConnectionError时捕获此错误 我认为应该使用生成器来清理代码,因为不应该像这样嵌入try/catch语句
def get_versions_url(my_url):
yield my_url
yield repl_www(my_url)
yield repl_http(my_url)
yield repl_http_www(my_url)
def get_html(urllist):
#use i only for indexes
for my_url in urllist:
for url_fixed in get_versions_url(my_url):
try:
# I dind't figure out why you return here and do not end first loop
return requests.get(url_fixed).text
except requests.exceptions.ConnectionError:
pass
except requests.exceptions.MissingSchema:
pass
然后可以调试生成器。
试着做:
for url in fix_url(<your url>):
print(url)
修复url()中url的:
打印(url)
我认为你的一些repl_函数没有像你期望的那样工作。什么是
repl_http
?def repl_www_http(url):x=url.replace(“www.”,“”)y=x.replace(“http”,“https”)返回y def repl_www(url):y=url.replace(“www.”,“”)返回y def repl_http(url):y=url.replace(“http”,“https”)返回y def repl_no_www(url):y=url。替换(“//”,“//www.”)返回y你能把这个放到你的问题中吗?是的!这是我关于Stackoverflow的第一个问题;-)您是否尝试打印正在分析的url
?可能是您的repl_http
函数没有按预期工作,只将h
作为url返回。