Python raise NotSupported(响应内容为';t text";)-scrapy.exceptions.NotSupported:响应内容为';t文本
几天后我就犯了同样的错误。我解决不了!!我真的不明白我的代码哪里不正确。我以前已经通过更改“链接”部分解决了类似的错误消息,但现在,它不再工作了。有人能帮我吗Python raise NotSupported(响应内容为';t text";)-scrapy.exceptions.NotSupported:响应内容为';t文本,python,xpath,scrapy,Python,Xpath,Scrapy,几天后我就犯了同样的错误。我解决不了!!我真的不明白我的代码哪里不正确。我以前已经通过更改“链接”部分解决了类似的错误消息,但现在,它不再工作了。有人能帮我吗 # -*- coding: utf-8 -*- import scrapy import re import numbers from amazon_test.items import AmazonTestItem from urllib.parse import urlparse from scrapy.spiders import C
# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from amazon_test.items import AmazonTestItem
from urllib.parse import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AmazonSellersSpider(CrawlSpider): #scrapy.Spider
name = 'AmazonFR'
allowed_domains = ['amazon.fr']
start_urls = ['https://www.amazon.fr']
rules = (
Rule(LinkExtractor(allow=()), callback='parse'),
)
def parse(self, response):
item = AmazonTestItem()
link = (response.xpath('//div[@class="a-column a-span6"]/h3[@id="-component-heading"]/text()'))
if link:
wait = response.xpath('//div[@class="a-column a-span6"]/h3[@id="-component-heading"]/text()').extract()
if (len(wait) != 0):
name = response.xpath('//div[@class="a-row a-spacing-medium"]/div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[contains(.,"Nom")]/following-sibling::text()').extract()
phone = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[contains(.,"Téléphone")]/following-sibling::text()').extract()
registre = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[contains(.,"registre de commerce")]/following-sibling::text()').extract()
TVA = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[contains(.,"TVA")]/following-sibling::text()').extract()
address = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[span[contains(.,"Adresse")]]/ul//li//text()').extract()
item['Business_name'] = ''.join(name).strip()
item['Phone_number'] = ''.join(phone).strip()
item['VAT_number'] = ''.join(TVA).strip()
item['Address'] = '\n'.join(address).strip()
item['Registre_commerce'] = ''.join(registre).strip()
yield item
else:
for sel in response.xpath('//html/body'):
item = AmazonTestItem()
list_urls = sel.xpath('//a/@href').extract()
for url in list_urls:
yield scrapy.Request(response.urljoin(url), callback=self.parse, meta={'item': item})
错误消息是:
Traceback (most recent call last):
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\paulpo\Documents\amazon_test\amazon_test\spiders\AmazonFR.py", line 21, in parse
link = (response.xpath('//div[@class="a-column a-span6"]/h3[@id="-component-heading"]/text()')).extract
File "C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\http\response\__init__.py", line 105, in xpath
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text
回溯(最近一次呼叫最后一次):
文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\utils\defer.py”,第102行,在iter\u errback中
下一个(it)
文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\spidermiddleware\offsite.py”,第29行,进程中输出
对于结果中的x:
文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\spidermiddleware\referer.py”,第339行,在
返回(_set_referer(r)表示结果中的r或())
文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\spidermiddleware\urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\spidermiddleware\depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“C:\Users\paulpo\Documents\amazon\u test\amazon\u test\spider\AmazonFR.py”,第21行,在parse中
link=(response.xpath('//div[@class=“a-column a-span6”]/h3[@id=“-component heading”]]/text()).extract
xpath中的文件“C:\Users\paulpo\AppData\Local\Continuum\Anaconda3\lib\site packages\scrapy\http\response\\ uuuu init\uuuuu.py”,第105行
不支持raise(“响应内容不是文本”)
scrapy.exceptions.NotSupported:响应内容不是文本
您是否遗漏了链接的.extract()?我只是想知道是否存在具有此结构的文本。您认为我需要添加“extract()”吗?因为如果没有文本,他将无法提取它。这个错误意味着HTTP响应不能被解码为HTML或XML,因此您不能对其使用.xpath()
。打印原始正文中的几个字节,可能还有一些标题会很有趣,比如self.logger((response.headers,response.body[:256])
Mmh,现在我有一条消息:INFO:忽略响应:HTTP状态代码未处理或不允许
。我把你的self.logger(某物)
放在我的解析函数中。。。有什么想法吗?我只能建议你阅读亚马逊的条款和条件。如果他们在某种程度上允许刮擦(我不知道),请确保尊重robots.txt和。