Python 如何修复由于服务器阻止web抓取而导致的这些错误？_Python_Web Crawler

Python 如何修复由于服务器阻止web抓取而导致的这些错误？

python web-crawler

Python 如何修复由于服务器阻止web抓取而导致的这些错误？,python,web-crawler,Python,Web Crawler,我正在尝试使用“get_text”功能从网页中获取文本，如前所述这对于这个特定的网站来说很好，但是当我尝试从另一个网站上刮取时，我得到了403错误： import urllib.request from inscriptis import get_text url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-lif

我正在尝试使用“get_text”功能从网页中获取文本，如前所述

这对于这个特定的网站来说很好，但是当我尝试从另一个网站上刮取时，我得到了403错误：

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这会在

html=urllib.request.urlopen（url.read（）.decode（'utf-8'）

行中出现以下错误：

我试图通过如下方式指定用户代理来修复它：

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

但我得到了以下错误：

TypeError: urlopen() got an unexpected keyword argument 'headers'

AttributeError: 'Response' object has no attribute 'strip'

由于

urlopen

的

headers

错误未定义，因此我尝试使用

请求

模块指定用户代理，如下所示：

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

但这会产生以下错误：

TypeError: urlopen() got an unexpected keyword argument 'headers'

AttributeError: 'Response' object has no attribute 'strip'

我该如何让这台该死的服务器停止阻止我的网络爬网呢？

您需要处理响应的主体，而不是响应对象本身：

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))

所有这些错误似乎都不相关：第一个错误是网站上的资源访问，这只是禁止您访问服务器端；第二个错误是对函数api使用了错误的关键字；第三个错误似乎与代码中隐藏得更深的内容有关，如果您不能提供更多的错误上下文（您在发布的代码中没有使用对象“Response”），则会产生错误

AttributeError:“Response”对象没有属性“body”

。请尝试

Response.text

@MartinEvans好的，这很有效。但是，它会打印出所有文本（甚至是该页面其他链接中的文本、页面上发布的其他广告等）。我只想在主要文章中找到文本，这只是正在打印的内容的一小部分。好的，解决了它。我使用BeautifulSoup在“Normal”类div中查找并提取文本正文。