Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/307.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何修复由于服务器阻止web抓取而导致的这些错误?_Python_Web Crawler - Fatal编程技术网

Python 如何修复由于服务器阻止web抓取而导致的这些错误?

Python 如何修复由于服务器阻止web抓取而导致的这些错误?,python,web-crawler,Python,Web Crawler,我正在尝试使用“get_text”功能从网页中获取文本,如前所述 这对于这个特定的网站来说很好,但是当我尝试从另一个网站上刮取时,我得到了403错误: import urllib.request from inscriptis import get_text url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-lif

我正在尝试使用“get_text”功能从网页中获取文本,如前所述

这对于这个特定的网站来说很好,但是当我尝试从另一个网站上刮取时,我得到了403错误:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)
这会在
html=urllib.request.urlopen(url.read().decode('utf-8')
行中出现以下错误:


我试图通过如下方式指定用户代理来修复它:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)
from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))
但我得到了以下错误:

TypeError: urlopen() got an unexpected keyword argument 'headers'
AttributeError: 'Response' object has no attribute 'strip'

由于
urlopen
headers
错误未定义,因此我尝试使用
请求
模块指定用户代理,如下所示:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)
from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))
但这会产生以下错误:

TypeError: urlopen() got an unexpected keyword argument 'headers'
AttributeError: 'Response' object has no attribute 'strip'

我该如何让这台该死的服务器停止阻止我的网络爬网呢?

您需要处理响应的主体,而不是响应对象本身:

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))

所有这些错误似乎都不相关:第一个错误是网站上的资源访问,这只是禁止您访问服务器端;第二个错误是对函数api使用了错误的关键字;第三个错误似乎与代码中隐藏得更深的内容有关,如果您不能提供更多的错误上下文(您在发布的代码中没有使用对象“Response”),则会产生错误
AttributeError:“Response”对象没有属性“body”
。请尝试
Response.text
@MartinEvans好的,这很有效。但是,它会打印出所有文本(甚至是该页面其他链接中的文本、页面上发布的其他广告等)。我只想在主要文章中找到文本,这只是正在打印的内容的一小部分。好的,解决了它。我使用BeautifulSoup在“Normal”类div中查找并提取文本正文。