Python 403尽管在标头中设置了用户代理,但仍在进行刮取时出错

Python 403尽管在标头中设置了用户代理,但仍在进行刮取时出错,python,python-requests,Python,Python Requests,我想从一个足球比赛的网站上搜集球员的统计数据,但是我得到了一个403错误。这是我第一次尝试 刮擦 url= 编辑:我可以使用浏览器chrome打开网页 编辑2:如果我跑 print(result.status_code) print(result.headers) print(result.content) 然后我得到了以下结果 403 {'Content-Type': 'text/html', 'Cache-Control': 'no-cache', 'Connection': 'close

我想从一个足球比赛的网站上搜集球员的统计数据,但是我得到了一个403错误。这是我第一次尝试 刮擦

url=

编辑:我可以使用浏览器chrome打开网页

编辑2:如果我跑

print(result.status_code)
print(result.headers)
print(result.content)
然后我得到了以下结果

403
{'Content-Type': 'text/html', 'Cache-Control': 'no-cache', 'Connection': 'close', 'Content-Length': '736', 'X-Iinfo': '9-168604272-0 0NNN RT(1566297863307 56) q(0 -1 -1 -1) r(0 -1) B15(4,200,0) U18', 'X-Iejgwucgyu': '1', 'Set-Cookie': 'visid_incap_774904=wSb3+5UxQeC+slK3rAhjswfPW10AAAAAQUIPAAAAAADmqJS6Gs0uzOV2Z5XomjoU; expires=Wed, 19 Aug 2020 06:56:00 GMT; path=/; Domain=.whoscored.com, incap_ses_198_774904=2GHrGcAd9C8niMLwwnK/AgfPW10AAAAAttp7+XadyowHY5iqiWs/Yg==; path=/; Domain=.whoscored.com'}
b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?CWUDNSAI=21&xinfo=9-168604272-0%200NNN%20RT%281566297863307%2056%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%284%2c200%2c0%29%20U18&incident_id=198003090216026722-548063901729035097&edet=15&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 198003090216026722-548063901729035097</iframe></body></html>'

您需要将cookies添加到会话中。它起作用了。我已从浏览器中添加cookies

import requests

session = requests.Session()

session.headers.update({'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0"})

session.cookies["visid_incap_774904"]="SRvZ2F36RzuA5U8jaUC8yq3fXF0AAAAAQUIPAAAAAAC/7mBuVWtbzccGROHlxPzv"
session.cookies["incap_ses_964_774904"]="hJHbakasVSAoo8+/rNFgDa7fXF0AAAAA0e9groglmml+odd4mLW2zg=="
session.cookies["_cmpQcif3pcsupported"]="0"
session.cookies["googlepersonalization"]="OloL0IOloL0IgA"
session.cookies["eupubconsent"]="BOloL0IOloL0IAKAYAENAAAA6AAAAA"
session.cookies["euconsent"]="BOloL0IOloL0IAKAYBENCh-AAAAp57v______9______9uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4u_1vf99yfm1-7etr3tp_87ues2_Xur__79__3z3_9phP78k89r7337Ew-v83oA"

resp = session.get("https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City")

print(resp.status_code)

print(resp.text)


你的代码给我状态200-即使没有用户代理。也许有一天服务器出了问题。你能用网络浏览器打开它吗?或者你提出了太多的请求,所以服务器会阻止你。@furas我可以用我的网页浏览器打开它。我已经添加了一些详细信息。result.content包含ROBOTS这个词,这让我觉得我的请求已经作为机器人处理了。是的,他们不允许机器人进行编程访问。除了请求,还有其他一些方法有点复杂。get您可以模拟web浏览器,但不使用请求。有关详细信息,请参见此问题:
import requests

session = requests.Session()

session.headers.update({'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0"})

session.cookies["visid_incap_774904"]="SRvZ2F36RzuA5U8jaUC8yq3fXF0AAAAAQUIPAAAAAAC/7mBuVWtbzccGROHlxPzv"
session.cookies["incap_ses_964_774904"]="hJHbakasVSAoo8+/rNFgDa7fXF0AAAAA0e9groglmml+odd4mLW2zg=="
session.cookies["_cmpQcif3pcsupported"]="0"
session.cookies["googlepersonalization"]="OloL0IOloL0IgA"
session.cookies["eupubconsent"]="BOloL0IOloL0IAKAYAENAAAA6AAAAA"
session.cookies["euconsent"]="BOloL0IOloL0IAKAYBENCh-AAAAp57v______9______9uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4u_1vf99yfm1-7etr3tp_87ues2_Xur__79__3z3_9phP78k89r7337Ew-v83oA"

resp = session.get("https://www.whoscored.com/Matches/1375928/LiveStatistics/England-Premier-League-2019-2020-West-Ham-Manchester-City")

print(resp.status_code)

print(resp.text)