如何使用Python使用内容编码来抓取网站？_Python_Python 3.x_Web Scraping_Python Requests

如何使用Python使用内容编码来抓取网站？

python python-3.x web-scraping

如何使用Python使用内容编码来抓取网站？,python,python-3.x,web-scraping,python-requests,Python,Python 3.x,Web Scraping,Python Requests,我正在努力搜刮一个在线新闻网站 st_url = "https://www.straitstimes.com/" page = requests.get(st_url) # Output: ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: inc

我正在努力搜刮一个在线新闻网站

st_url = "https://www.straitstimes.com/"
page = requests.get(st_url)

# Output: 
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

我还是一个新手，我不知道这是否意味着网站禁止我刮，或者我只是做错了

除了尝试请求之外，我还尝试在Chrome开发工具中查找XML API链接，但找不到

如果能帮上忙，我将不胜感激。谢谢。

如果打开调试日志

import logging
logging.basicConfig(level='DEBUG')

…您将看到，您从该网站获得403响应：

>>> import logging
>>> import requests
>>> logging.basicConfig(level='DEBUG')
>>> st_url = "https://www.straitstimes.com/"
>>> page = requests.get(st_url)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 403 345

看起来站点可能会拒绝任何

请求

作为默认用户代理发送的内容。我试着从命令行用

curl

发出同样的请求，效果很好

如果我抓取当前Firefox用户代理字符串并发出请求，它似乎可以工作：

>>> page = requests.get(st_url, headers={'user-agent': 'Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0'})
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 200 51378

在本例中，您可以看到请求成功