Web scraping urllib.error.HTTPError:HTTP错误404:未找到--Web垃圾处理的困难

Web scraping urllib.error.HTTPError:HTTP错误404:未找到--Web垃圾处理的困难,web-scraping,html-table,Web Scraping,Html Table,我正在尝试创建一个表,这是我正在使用的代码。我尝试了各种各样的方法,但我对Python还不熟悉,它们不起作用。有人有主意吗?在你的回答中,请包括代码的这一部分将被插入的地方 import urllib import urllib.request from bs4 import BeautifulSoup def make_soup(url): thepage = urllib.request.urlopen(url) soupdata = BeautifulSoup(thepa

我正在尝试创建一个表,这是我正在使用的代码。我尝试了各种各样的方法,但我对Python还不熟悉,它们不起作用。有人有主意吗?在你的回答中,请包括代码的这一部分将被插入的地方

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
我得到一个错误:

Traceback (most recent call last):
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
  File "C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py", line 11, in <module>
soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
  File "C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py", line 7, in make_soup
thepage = urllib.request.urlopen(url)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
回溯(最近一次呼叫最后一次):
文件“C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\runpy.py”,第194行,位于\u run\u模块\u as\u main中
返回运行代码(代码、主全局、无、,
文件“C:\Users\GBEM\AppData\Local\Programs\Python 38-32\lib\runpy.py”,第87行,在运行代码中
exec(代码、运行\全局)
文件“C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py”,第11行,在
汤=做汤https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
文件“C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py”,第7行,make\u-soup
页面=urllib.request.urlopen(url)
文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”,第222行,在urlopen中
返回opener.open(url、数据、超时)
文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”,第531行,处于打开状态
响应=方法(请求,响应)
http\U响应中的文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”,第640行
响应=self.parent.error(
文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”第569行出错
返回自我。调用链(*args)
文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”,第502行,在调用链中
结果=func(*args)
文件“C:\Users\GBEM\AppData\Local\Programs\Python38-32\lib\urllib\request.py”,第649行,默认为http\u error\u
raise HTTPError(请求完整的url、代码、消息、hdrs、fp)
urllib.error.HTTPError:HTTP错误404:未找到

要从服务器获得正确的响应,请指定
用户代理
HTTP头:

import urllib.request
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

def make_soup(url):
    req = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(req)
    return BeautifulSoup(response.read(), 'html.parser')

soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
print(soup)
印刷品:

<!DOCTYPE html>

<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if IE 7]>
<html class="ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>
<html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en"> <!--<![endif]-->
<head>

..and so on.

等等

非常感谢!解决了问题