使用Python进行Web抓取时出现HTTP错误404:未找到
我是Python的新手,并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷,但当我尝试提取数据时,它会给我HTTP错误404。这是我的密码:使用Python进行Web抓取时出现HTTP错误404:未找到,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是Python的新手,并不擅长Python。我试图从一个名为Transfermarkt的网站上抓取数据。我是一个足球迷,但当我尝试提取数据时,它会给我HTTP错误404。这是我的密码: from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/ver
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
for che in chelsea:
player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"]
print("player: " +player)
错误是:
非常感谢您的帮助。正如上面提到的Rup,您的用户代理可能已被服务器拒绝 尝试使用以下内容扩充代码:
import urllib.request # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"
# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}
# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)
# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
page_html = response.read()
完成上述代码后,您可以继续分析。Python文档中有一些关于此主题的有用页面:
Mozilla的文档中有大量用户代理字符串需要尝试:
乍一看,服务器似乎拒绝了urllib的用户代理。试试看。
import urllib.request # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"
# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}
# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)
# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
page_html = response.read()