使用Python进行Web抓取时出现HTTP错误404:未找到_Python_Web Scraping_Beautifulsoup

使用Python进行Web抓取时出现HTTP错误404:未找到

python web-scraping

使用Python进行Web抓取时出现HTTP错误404:未找到,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是Python的新手，并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷，但当我尝试提取数据时，它会给我HTTP错误404。这是我的密码： from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/ver

我是Python的新手，并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷，但当我尝试提取数据时，它会给我HTTP错误404。这是我的密码：

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"




uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")


for che in chelsea:
          player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"]



print("player: " +player)

错误是：

非常感谢您的帮助。正如上面提到的Rup，您的用户代理可能已被服务器拒绝

尝试使用以下内容扩充代码：

import urllib.request  # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
    page_html = response.read()

完成上述代码后，您可以继续分析。Python文档中有一些关于此主题的有用页面：

Mozilla的文档中有大量用户代理字符串需要尝试：

乍一看，服务器似乎拒绝了urllib的用户代理。试试看。

import urllib.request  # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
    page_html = response.read()