使用Python进行Web抓取时出现HTTP错误404:未找到

使用Python进行Web抓取时出现HTTP错误404:未找到,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是Python的新手,并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷,但当我尝试提取数据时,它会给我HTTP错误404。这是我的密码: from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/ver

我是Python的新手,并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷,但当我尝试提取数据时,它会给我HTTP错误404。这是我的密码:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"




uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")


for che in chelsea:
          player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"]



print("player: " +player)
错误是:


非常感谢您的帮助。正如上面提到的Rup,您的用户代理可能已被服务器拒绝

尝试使用以下内容扩充代码:

import urllib.request  # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
    page_html = response.read()
完成上述代码后,您可以继续分析。Python文档中有一些关于此主题的有用页面:

Mozilla的文档中有大量用户代理字符串需要尝试:


乍一看,服务器似乎拒绝了urllib的用户代理。试试看。
import urllib.request  # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
    page_html = response.read()