Python Scrapy：等待解析一些URL，然后执行一些操作_Python_Scrapy

Python Scrapy：等待解析一些URL，然后执行一些操作

python scrapy

Python Scrapy：等待解析一些URL，然后执行一些操作,python,scrapy,Python,Scrapy,我有一只蜘蛛需要找到产品的价格。这些产品按批次（来自数据库）组合在一起，最好有一个批次状态（运行、完成）以及start\u time和finished\u time属性。所以我有点像： class PriceSpider(scrapy.Spider): name = 'prices' def start_requests(self): for batch in Batches.objects.all(): batch.started_o

我有一只蜘蛛需要找到产品的价格。这些产品按批次（来自数据库）组合在一起，最好有一个批次状态（运行、完成）以及

start\u time

和

finished\u time

属性。所以我有点像：

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            for prod in batch.get_products():
                yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # <-- NOT COOL: This is goind to 
                          # execute before the last product 
                          # url is scraped, right?

    def parse(self, response):
        #...

类价格蜘蛛（scrapy.Spider）：
名称=‘价格’
def start_请求（自我）：
对于batch in batch.objects.all（）中的批处理：
batch.started\u on=datetime.now（）
batch.status='正在运行'
batch.save（）
对于批量生产。获取产品（）：
生成scrapy.Request（product.get_scrape_url（），meta={'prod'：prod}）
batch.status='DONE'
batch.finished_on=datetime.now（）
batch.save（）#对于此类交易，您可以使用它在爬行器完成爬行时绑定函数以运行。
对于此类交易，您可以使用它在爬行器完成爬行时绑定函数以运行。
以下是技巧
对于每个请求，发送批次id
，此批次中的产品总数和已处理的此批次

以及任何函数检查中的任意位置
for batch in Batches.objects.all():
    processed_this_batch = 0
    # TODO: Get some batch_id here
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`

    for prod in batch.get_products():
        processed_this_batch  = processed_this_batch  + 1
        yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

在代码中的任何地方，对于任何特定批次，检查如果已处理此批次==此批次中的产品总数
，然后保存批次
以下是技巧
对于每个请求，发送批次id
，此批次中的产品总数和已处理的此批次

以及任何函数检查中的任意位置
for batch in Batches.objects.all():
    processed_this_batch = 0
    # TODO: Get some batch_id here
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`

    for prod in batch.get_products():
        processed_this_batch  = processed_this_batch  + 1
        yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

在代码中的任何地方，对于任何特定的批次，检查如果处理了此批次==此批次中的产品总数
，然后保存批次
我对@Umair建议进行了一些修改，并提出了一个非常适合我的解决方案：
class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

只要start_请求生成的所有请求都具有不同的url，这就可以正常工作
如果存在任何重复项，scrapy将过滤掉它们，而不会调用您的parse方法，
因此，您将使用计数器['curr']
，而批处理状态将永远保持运行状态
事实证明，对于重复项，您可以覆盖scrapy的行为
首先，我们需要更改settings.py以指定替代的“duplicates filter”类：
然后，我们创建MyDupeFilter
类，让爬行器知道何时存在重复：
class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

然后，我们修改spider，使其在发现重复项时增加计数器：
class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

我们准备好了
我对@Umair建议做了一些修改，并提出了一个非常适合我的解决方案：
class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

只要start_请求生成的所有请求都具有不同的url，这就可以正常工作
如果存在任何重复项，scrapy将过滤掉它们，而不会调用您的parse方法，
因此，您将使用计数器['curr']
，而批处理状态将永远保持运行状态
事实证明，对于重复项，您可以覆盖scrapy的行为
首先，我们需要更改settings.py以指定替代的“duplicates filter”类：
然后，我们创建MyDupeFilter
类，让爬行器知道何时存在重复：
class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

然后，我们修改spider，使其在发现重复项时增加计数器：
class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

我们很乐意去
有趣的是，我看到这些信号可能很有用。在这种情况下，虽然“closed”可能不是正确的（因为spider将处理多个批，理想情况下，我想知道每个批何时完成）有趣，但我认为这些信号是有用的。在这种情况下，虽然“closed”可能不是正确的选项（因为爬行器将处理多个批次，理想情况下，我想知道每个批次何时完成），但它并没有完全按照您的建议工作（我必须在parse
方法中增加计数器。如果我在生成这样的请求之前这样做，我最终会在批处理真正完成之前将其标记为已完成）.但是你的建议确实为我指明了正确的方向，所以非常感谢！顺便说一句，我最终用我的完整解决方案回答了这个问题。我在发布时没有运行代码…我只是发布了我关于如何实现你想要的目标的想法，我不是在抱怨什么，只是想指出结果：-）你的帖子为我指明了正确的方向。谢谢！它没有像您建议的那样工作（我必须在parse
方法中增加计数器。如果我在生成这样的请求之前这样做，我最终会在批处理真正完成之前将其标记为已完成）。但是你的建议确实为我指明了正确的方向，所以非常感谢！顺便说一句，我最终用我的完整答案回答了这个问题。我在发布时没有运行代码。。。我只是发表了关于如何实现你想要的东西的想法，我不是在抱怨什么，只是想指出它是如何实现的：-）你的帖子为我指明了正确的方向。谢谢！