Scrapy Spider工作正常,但没有';不要勉强获得一些结果
它工作正常,大约有208个产品信息,但是对于一些产品细节,它没有给出结果,我已经在scrapy shell中单独执行了这些产品链接,工作正常,但是为什么它遗漏了25%的产品细节 我尝试了旋转用户代理,应用了不同的XPath,但没有成功Scrapy Spider工作正常,但没有';不要勉强获得一些结果,scrapy,web-crawler,Scrapy,Web Crawler,它工作正常,大约有208个产品信息,但是对于一些产品细节,它没有给出结果,我已经在scrapy shell中单独执行了这些产品链接,工作正常,但是为什么它遗漏了25%的产品细节 我尝试了旋转用户代理,应用了不同的XPath,但没有成功 import scrapy from scrapy.spiders import CrawlSpider, Rule from ..items import AmazonItem import time from scrapy.linkextractors imp
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from ..items import AmazonItem
import time
from scrapy.linkextractors import LinkExtractor
import urllib.parse
class QuotesSpider(scrapy.Spider):
name = 'pet'
start_urls = ['https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&qid=1567115653&rnid=1632651031&ref=sr_nr_p_89_1',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=2',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=3',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=4',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=5'
]
def parse(self, response):
links =response.xpath("//h2/a[contains(@href,'/dp')]/@href").extract()
urll = ['https://www.amazon.co.uk' + link for link in links]
urls = urll
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
global name1
global sales_rank11
global price1
global prime1
list = AmazonItem()
name = response.xpath(".//*[(@id ='productTitle')]/text()").extract_first()
if name is None:
name1 = name
self.logger.info('skip')
else:
name1 = name.replace('\n', '').strip()
price = response.xpath("//span[@id='price_inside_buybox']/text()").get()
if price is None:
price1 = response.xpath("//span[@class='a-color-price']/text()").get()
if price1 is None:
price1 = 'No Price Avaiable'
self.logger.info('skip')
else:
price1 = price.replace('\n', '').replace(' ','')
prime = response.xpath("//span[@id='price-shipping-message']/b").get()
if prime is None:
prime1 = 'Not Prime'
else:
prime1 = 'Prime'
sales_rank1 = response.xpath("//tr[@id='SalesRank']/td[@class='value']/text()").get()
if sales_rank1 is None:
sales_rank11 = 'No Sales Rank Available'
else:
sales_rank11 = sales_rank1.replace('(','').replace('\n','')
list['Name'] = name1
list['Price'] = price1
list['SalesRank'] = sales_rank11
list['Prime'] = prime1
list['Url'] = response.url
yield list
我遗漏了什么吗?你能检查一下你发布的代码缩进并发布你的
amazonim
代码吗?最好删除示例中任何不必要的部分以检查您得到的响应,因为您被检测为机器人,我的Amazon可能会返回不完整的响应。感谢您的回复,我发现了,它给了我一些页面的captcha响应,有什么方法可以防止吗?