Python 我无法打开存在的网站_Python_Web Scraping

Python 我无法打开存在的网站

python web-scraping

Python 我无法打开存在的网站,python,web-scraping,Python,Web Scraping,我得到一个错误，使我相信我的程序无法找到一个我知道存在的网站。该网站是 https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207 我的代码看起来像 from urllib import request as u_r def strip_webite(): with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spiele

我得到一个错误，使我相信我的程序无法找到一个我知道存在的网站。该网站是

https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207

我的代码看起来像

from urllib import request as u_r

def strip_webite():

  with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
      pass

if __name__ == "__main__":
  strip_webite()

我得到的错误是

  File "database_management.py", line 19, in <module>
    strip_webite()
  File "database_management.py", line 15, in strip_webite
    with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 404: Not Found

文件“database_management.py”，第19行，在
strip_webite（）
文件“database_management.py”，第15行，在strip_webite中
使用u_r.urlopen（“https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207）作为f：
urlopen中的文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第223行
返回opener.open（url、数据、超时）
文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第532行，打开
响应=方法（请求，响应）
文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第642行，在http\u响应中
“http”、请求、响应、代码、消息、hdrs）
文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第570行出错
返回自我。调用链（*args）
文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第504行，在调用链中
结果=func（*args）
文件“/usr/local/ceral/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py”，第650行，默认为http\u error\u
raise HTTPError（请求完整的url、代码、消息、hdrs、fp）
urllib.error.HTTPError:HTTP错误404:未找到

看起来Transfermarkt正在使用Python的

urllib

库发送的默认

User-Agent

字符串阻止来自bot的请求，尽管它在其声明中没有提到这方面的任何内容

这似乎意味着他们不介意我们刮它们，但他们更希望我们宣布我们是谁

要使用urllib执行此操作，请执行以下操作：

from urllib import request as u_r

def strip_webite():

  request = u_r.Request("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207")
  request.add_header('User-Agent', 'my-cool-app')
  with u_r.urlopen(request) as f:
      pass

if __name__ == "__main__":
  strip_webite()

看起来Transfermarkt正在使用Python的

urllib

库发送的默认

User Agent

字符串阻止来自bot的请求，尽管它在其声明中没有提到这方面的任何内容

这似乎意味着他们不介意我们刮它们，但他们更希望我们宣布我们是谁

要使用urllib执行此操作，请执行以下操作：

from urllib import request as u_r

def strip_webite():

  request = u_r.Request("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207")
  request.add_header('User-Agent', 'my-cool-app')
  with u_r.urlopen(request) as f:
      pass

if __name__ == "__main__":
  strip_webite()

错误来自调用

urlopen

；这与BeautifulSoup无关。我认为可能是这样，但不是100%确定。知道我为什么不能打开它吗？对不起，这对我来说也是个谜。web服务器可以返回它想要的任何响应代码。特别是为了阻止网络爬虫，该网站可能明确表示这违反了其服务条款，请尝试添加一个标题。错误来自于调用

urlopen

；这与BeautifulSoup无关。我认为可能是这样，但不是100%确定。知道我为什么不能打开它吗？对不起，这对我来说也是个谜。web服务器可以返回它想要的任何响应代码。特别是为了阻止网络爬虫，网站可能明确表示这违反了其服务条款，请尝试添加标题。因此，这是您使用的描述性标题。这个怎么样：

headers={'User-Agent'：'notfound404'}

。它将为您带来相同的响应。当你处理雨的时候，你也需要关心泥浆。我不知道他的应用程序在做什么，所以我不能想出一个描述性的字符串。我放了一个明显的占位符，它不太可能出现在野外，因此，如果他真的使用它，网站管理员可以根据他的需要阻止他。

urllib

广播一个用户代理头（它是“Python-urllib/version”），因此网站可以阻止

urllib

的默认用户代理。尽管如此，这个答案还是解决了这个问题，因此+1So，这是您使用的描述性标题。这个怎么样：

headers={'User-Agent'：'notfound404'}

urllib

广播一个用户代理头（它是“Python-urllib/version”），因此网站可以阻止

urllib

的默认用户代理。尽管如此，这个答案还是解决了问题，所以+1