Python 我怎样才能让Scrapy只购买最低价的商品？_Python_Scrapy

Python 我怎样才能让Scrapy只购买最低价的商品？

python scrapy

Python 我怎样才能让Scrapy只购买最低价的商品？,python,scrapy,Python,Scrapy,我正在抓取的网站有多个产品，它们的ID相同，但价格不同。我只想保留最低价的版本 from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = dict() def process_item(self, item, spider): if item['ID'] in self.ids_see

我正在抓取的网站有多个产品，它们的ID相同，但价格不同。我只想保留最低价的版本

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = dict()

    def process_item(self, item, spider):
        if item['ID'] in self.ids_seen:
            if item['sale_price']>self.ids_seen[item['ID']]:
                raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['ID'])
            return item

因此，这段代码应该删除比以前看到的价格更高的项目，但是如果价格更低，我不知道如何更新以前删除的项目

# -*- coding: utf-8 -*-
import scrapy
import urlparse
import re

class ExampleSpider(scrapy.Spider):
    name = 'name'
    allowed_domains = ['domain1','domain2']
    start_urls = ['url1','url2']

    def parse(self, response):
        for href in response.css('div.catalog__main__content .c-product-card__name::attr("href")').extract():
            url = urlparse.urljoin(response.url, href) 
            yield scrapy.Request(url=url, callback=self.parse_product)

    # follow pagination links
        href = response.css('.c-paging__next-link::attr("href")').extract_first()
        if href is not None:
            url = urlparse.urljoin(response.url, href) 
            yield scrapy.Request(url=url, callback=self.parse)
    def parse_product(self, response):
       # process the response here (omitted because it's long and doesn't add anything)
        yield {
            'product-name': name,
            'price-sale': price_sale,
            'price-regular': price_regular[:-1],
            'raw-sku': raw_sku,
            'sku': sku.replace('_','/'),
            'img': response.xpath('//img[@class="itm-img"]/@src').extract()[-1],
            'description': response.xpath('//div[@class="product-description__block"]/text()').extract_first(),
            'url' : response.url,
        }

你不能用管道来做这件事，因为它正在进行中。换句话说，它在运行时返回项目，而不必等待spider完成

但是，如果您有数据库，您可以绕过此问题：

在semy伪代码中：

class DbPipeline(object):

    def __init__(self):
        self.connection = # connect to your database

    def process_item(self, item, spider):
        db_item = self.connection.get(item['ID'])
        if item['price'] < db_item['price']:
            self.connection.remove(item['ID'])
            self.connection.add(item)
        return item

类DbPipeline（对象）：
定义初始化（自）：
self.connection=#连接到您的数据库
def过程_项目（自身、项目、蜘蛛）：
db_item=self.connection.get（item['ID']）
如果项目['price']


您仍将在scrapy输出中获得未过滤的结果，但您的数据库将被排序。

个人建议使用基于文档的数据库、键对值数据库，如redis
开始之前您知道产品Id吗？如果是这样，那么正常的网站行为将允许您搜索价格低>高，因此您可以刮取每个产品Id返回的第一个项目，这将避免任何管道处理的需要
如果没有，则可以执行两步流程，首先搜索所有产品以获取Id，然后对每个Id执行上述流程。
是否完全没有通过自定义项目导出器执行此操作的选项？我想在scrapinghub上运行这个，我已经设置了一个系统来使用他们的API。@TahaAttari您可能会这样做，但这意味着在将所有数据写入文件之前将其保存在缓冲区中，这是一个坏主意，除非您的爬网非常小。您正在抓取的是什么网站？蜘蛛的代码是什么？@Umair我不能告诉你这个网站，但我已经包括了蜘蛛的代码。不确定它是否适用于这个问题，但在这里。我不知道ID，但从按价格从低到高排序的页面开始是一个好主意！谢谢