如何使用Python使用内容编码来抓取网站?

如何使用Python使用内容编码来抓取网站?,python,python-3.x,web-scraping,python-requests,Python,Python 3.x,Web Scraping,Python Requests,我正在努力搜刮一个在线新闻网站 st_url = "https://www.straitstimes.com/" page = requests.get(st_url) # Output: ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: inc

我正在努力搜刮一个在线新闻网站

st_url = "https://www.straitstimes.com/"
page = requests.get(st_url)

# Output: 
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
我还是一个新手,我不知道这是否意味着网站禁止我刮,或者我只是做错了

除了尝试请求之外,我还尝试在Chrome开发工具中查找XML API链接,但找不到


如果能帮上忙,我将不胜感激。谢谢。

如果打开调试日志

import logging
logging.basicConfig(level='DEBUG')
…您将看到,您从该网站获得403响应:

>>> import logging
>>> import requests
>>> logging.basicConfig(level='DEBUG')
>>> st_url = "https://www.straitstimes.com/"
>>> page = requests.get(st_url)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 403 345
看起来站点可能会拒绝任何
请求
作为默认用户代理发送的内容。我试着从命令行用
curl
发出同样的请求,效果很好

如果我抓取当前Firefox用户代理字符串并发出请求,它似乎可以工作:

>>> page = requests.get(st_url, headers={'user-agent': 'Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0'})
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 200 51378
在本例中,您可以看到请求成功