python尝试并排除url更正python 3_Python_Python 3.x_Url_Web Scraping_Python Requests

python尝试并排除url更正python 3

python python-3.x url web-scraping

python尝试并排除url更正python 3,python,python-3.x,url,web-scraping,python-requests,Python,Python 3.x,Url,Web Scraping,Python Requests,我正在尝试从网页中获取HTML。然而，并非所有的URL都被正确写入。列表中大多数无效URL包括http，但现在URL使用https。有些缺少“www.”，有些则需要添加“www.” def repl_www_http(url): x = url.replace("www.", "") y = x.replace("http", "https") return y def repl_www(url): y = url.replace("www.", "")

我正在尝试从网页中获取HTML。然而，并非所有的URL都被正确写入。列表中大多数无效URL包括http，但现在URL使用https。有些缺少“www.”，有些则需要添加“www.”

def repl_www_http(url):
    x = url.replace("www.", "")
    y = x.replace("http", "https")
    return y

def repl_www(url):
    y = url.replace("www.", "")
    return y

def repl_http(url):
    y = url.replace("http", "https")
    return y

def repl_no_www(url):
    y = url.replace("//", "//www.")
    return y

def get_html(urllist):
        for i in urllist:
            html = ""
            try:
                html = requests.get(i)
                html = html.text
                return html
            except requests.exceptions.ConnectionError:
                try:
                    html = requests.get(repl_http(i))
                    html = html.text
                    print("replaced // with //www.")
                except requests.exceptions.ConnectionError:
                    try:
                        html = requests.get(repl_http(i))
                        html = html.text
                        print("replaced http with https")
                        return html
                    except requests.exceptions.ConnectionError:
                        try:
                            html = requests.get(repl_www(i))
                            html = html.text
                            print("replaced www. with .")
                            return html
                        except requests.exceptions.ConnectionError:
                            try:
                                html = requests.get(repl_www_http(i))
                                html = html.text
                                print("replaced www with . and http with https")
                                return html
                            except requests.exceptions.ConnectionError:
                                return "no HTML found on this URL"
        print("gethtml finished", html)

这就是我得到的错误：

Traceback (most recent call last):  File "C:\replacer.py", line 76, in <module>    html = get_html(i)
  File "C:\replacer.py", line 37, in get_html    html = requests.get(repl_http(i))
  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 498, in request
    prep = self.prepare_request(req)  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 441, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",line 309, in prepare
    self.prepare_url(url, params)  File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",
line 383, in prepare_url
    raise MissingSchema(error)requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

Traceback（最近一次调用last）：文件“C:\replacer.py”，第76行，html=get\u html（i）
文件“C:\replacer.py”，第37行，在get\u html=requests.get（repl\u http（i））中
文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\37\lib\site packages\requests-2.19.1-py3.7.egg\requests\api.py”，get第72行
返回请求（'get'，url，params=params，**kwargs）文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\api.py”，请求中第58行
return session.request（method=method，url=url，**kwargs）文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\sessions.py”，请求中第498行
prep=self.prepare_请求（req）文件“C:\Users\LorenzKort\AppData\Local\Programs\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\sessions.py”，第441行，prepare_请求
钩子=合并钩子（request.hooks，self.hooks），
文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\37\lib\site packages\requests-2.19.1-py3.7.egg\requests\models.py”，第309行，在prepare中
self.prepare_url（url，params）文件“C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site packages\requests-2.19.1-py3.7.egg\requests\models.py”，
准备url中的第383行
raise MissingSchema（错误）请求。异常。MissingSchema:无效URL“h”：未提供架构。也许你的意思是http://h？

如何解决此问题以更正错误的URL？

问题是URL传递给requests.get（）send MissingSchema错误，您应该在捕获ConnectionError时捕获此错误

我认为应该使用生成器来清理代码，因为不应该像这样嵌入try/catch语句

def get_versions_url(my_url):
    yield my_url
    yield repl_www(my_url)
    yield repl_http(my_url)
    yield repl_http_www(my_url)

def get_html(urllist):
    #use i only for indexes
    for my_url in urllist:
        for url_fixed in get_versions_url(my_url):
            try:
                # I dind't figure out why you return here and do not end first loop
                return requests.get(url_fixed).text
            except requests.exceptions.ConnectionError:
                pass
            except requests.exceptions.MissingSchema:
                pass

然后可以调试生成器。试着做：

for url in fix_url(<your url>):
    print(url)

修复url（）中url的

：
打印（url）

我认为你的一些repl_函数没有像你期望的那样工作。

什么是

repl_http

？def repl_www_http（url）：x=url.replace（“www.”，“”）y=x.replace（“http”，“https”）返回y def repl_www（url）：y=url.replace（“www.”，“”）返回y def repl_http（url）：y=url.replace（“http”，“https”）返回y def repl_no_www（url）：y=url。替换（“//”，“//www.”）返回y你能把这个放到你的问题中吗？是的！这是我关于Stackoverflow的第一个问题；-）您是否尝试打印正在分析的

url

？可能是您的

repl_http

函数没有按预期工作，只将

作为url返回。