Python 如何在Scrapy中暂停蜘蛛_Python_Web Scraping_Scrapy

Python 如何在Scrapy中暂停蜘蛛

python web-scraping scrapy

Python 如何在Scrapy中暂停蜘蛛,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我是scrapy的新手，在收到响应错误（如407429）后需要暂停爬行器。此外，我应该在不使用time.sleep（）的情况下执行此操作，并使用中间件或扩展这是我的中间产品： from scrapy import signals from pydispatch import dispatcher class Handle429: def __init__(self): dispatcher.connect(self.item_scraped, signal=sign

我是scrapy的新手，在收到响应错误（如407429）后需要暂停爬行器。
此外，我应该在不使用

time.sleep（）

的情况下执行此操作，并使用中间件或扩展

这是我的中间产品：

from scrapy import signals
from pydispatch import dispatcher

class Handle429:
    def __init__(self):
        dispatcher.connect(self.item_scraped, signal=signals.item_scraped)

    def item_scraped(self, item, spider, response):
        if response.status == 429:
            print("THIS IS 429 RESPONSE")
            #
            # here stop spider for 10 minutes and then continue
            #

我读过关于self.crawler.engine.pause（）的文章，但如何在中间件中实现它，并为暂停设置自定义时间？

还是有其他方法可以做到这一点？谢谢。

我已经解决了我的问题。首先，中间件可以有默认的foo，比如

process\u response

或

process\u request

在设置中.py

HTTPERROR_ALLOWED_CODES = [404]

然后，我更改了我的中间件类：

from twisted.internet import reactor
from twisted.internet.defer import Deferred

#replace class Handle429
class HandleErrorResponse:

    def __init__(self):
        self.time_pause = 1800

    def process_response(self, request, response, spider):
        # this foo called by default before the spider 
        pass

然后我找到一个代码，它可以帮助我在没有时间的情况下暂停spider.sleep（）

这就是工作。

我无法完全解释

reactor.callLater（）

是如何工作的，但我认为它只是停止了scrapy中的事件循环，然后您的响应将发送到爬行器。

为什么不使用

time.sleep（）

？是因为暂停是针对每个域的吗？

#in HandleErrorResponse
def process_response(self, request, response, spider):
    print(response.status)
    if response.status == 404:
        d = Deferred()
        reactor.callLater(self.time_pause, d.callback, response)

    return response