Python 可以在Scrapy中静默地将请求出列吗？_Python_Scrapy

Python 可以在Scrapy中静默地将请求出列吗？

python scrapy

Python 可以在Scrapy中静默地将请求出列吗？,python,scrapy,Python,Scrapy,我的特定用例是：我有一个刮板，它正在刮一个站点，一旦一个项目被产生——我有一个绑定信号，它在Redis中设置一个键，并设置一个过期时间。下次运行scraper时，它应该忽略Redis中存在密钥的所有URL 第一部分我做得很好；第二部分——我创建了一个DownloaderMiddle软件，它有一个process\u request函数，用于查看传入的请求对象，并检查其URL是否存在于Redis中。如果是，它将引发异常我想知道的是：有没有一种方法可以悄悄地将请求出列，而不是引发异常？与其说这

我的特定用例是：我有一个刮板，它正在刮一个站点，一旦一个项目被产生——我有一个绑定信号，它在Redis中设置一个键，并设置一个过期时间。下次运行scraper时，它应该忽略Redis中存在密钥的所有URL

第一部分我做得很好；第二部分——我创建了一个DownloaderMiddle软件，它有一个

process\u request

函数，用于查看传入的请求对象，并检查其URL是否存在于Redis中。如果是，它将引发异常

我想知道的是：有没有一种方法可以悄悄地将请求出列，而不是引发异常？与其说这是一项硬性要求，不如说这是一项美学要求；我不想在我的错误日志中看到一吨这样的错误——我只想看到真正的错误

我在Scrapy源代码中看到，他们在主计划程序（Scrapy/core/scheduler.py）中使用了看起来像是重复筛选的乱码：

Scrapy使用Python模块

logging

来记录事情。因为你想要的只是一个美观的东西，所以你可以编写一个函数来过滤掉你不想看到的东西。

Scrapy使用Python模块

记录来记录东西。因为您想要的只是一种美观的东西，所以您可以编写一个脚本来过滤掉您不想看到的东西。
中间件代码来自评论中的OP
def __init__(self, crawler):
    self.client = Redis()
    self.crawler = crawler
    self.crawler.signals.connect(self.process_request, signals.request_scheduled)

def process_request(self, request, spider):
    if not self.client.is_deferred(request.url): # URL is not deferred, proceed as normal
        return None
    raise IgnoreRequest('URL is deferred')

问题在于您在signals.request\u scheduled
上添加的信号处理程序。如果它引发了一个异常
我认为在这里将process\u request
注册为信号处理程序是不正确的（或不必要的）
我可以用这个类似的（不正确的）测试中间件重现您的控制台错误，它会忽略它看到的所有其他请求：
from scrapy import log, signals
from scrapy.exceptions import IgnoreRequest

class TestMiddleware(object):

    def __init__(self, crawler):
        self.counter = 0

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler)
        crawler.signals.connect(o.open_spider, signals.spider_opened)

        # this raises an exception always and will trigger errors in the console
        crawler.signals.connect(o.process, signals.request_scheduled)
        return o

    def open_spider(self, spider):
        spider.logger.info('TestMiddleware.open_spider()')

    def process_request(self, request, spider):
        spider.logger.info('TestMiddleware.process_request()')
        self.counter += 1
        if (self.counter % 2) == 0:
            raise IgnoreRequest("ignoring request %d" % self.counter)

    def process(self, *args, **kwargs):
        raise Exception

查看控制台在使用此中间件运行spider时的说明：
2016-04-06 00:16:58 [scrapy] ERROR: Error caught on signal handler: <bound method ?.process of <mwtest.middlewares.TestMiddleware object at 0x7f83d4a73f50>>
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy11rc3.py27/local/lib/python2.7/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/home/paul/.virtualenvs/scrapy11rc3.py27/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/paul/tmp/mwtest/mwtest/middlewares.py", line 26, in process
    raise Exception
Exception

IgnoreRequest
未打印在日志中，但您在末尾的统计中有异常计数：
$ scrapy crawl httpbin
2016-04-06 00:27:24 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: mwtest)
(...)
2016-04-06 00:27:24 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'mwtest.middlewares.TestMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
(...)
2016-04-06 00:27:24 [scrapy] INFO: Spider opened
2016-04-06 00:27:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.open_spider()
2016-04-06 00:27:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: None)
2016-04-06 00:27:25 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2016-04-06 00:27:25 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/headers> (referer: None)
2016-04-06 00:27:25 [scrapy] INFO: Closing spider (finished)
2016-04-06 00:27:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 2,
 'downloader/request_bytes': 665,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 13006,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 5, 22, 27, 25, 596652),
 'log_count/DEBUG': 4,
 'log_count/INFO': 13,
 'log_count/WARNING': 1,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2016, 4, 5, 22, 27, 24, 661345)}
2016-04-06 00:27:25 [scrapy] INFO: Spider closed (finished)

$scrapy crawl httpbin
2016-04-06 00:27:24[scrapy]信息：scrapy 1.1.0rc3已启动（bot:mwtest）
(...)
2016-04-06 00:27:24[scrapy]信息：已启用的下载程序中间件：
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware'，
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware'，
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware'，
'scrapy.DownloaderMiddleware.retry.RetryMiddleware'，
“mwtest.middleware.TestMiddleware”，
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware'，
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware'，
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware'，
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware'，
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”，
'scrapy.DownloaderMiddleware.chunked.ChunkedTransfererMiddleware'，
'scrapy.downloadermiddleware.stats.DownloaderStats']
(...)
2016-04-06 00:27:24[刮擦]信息：蜘蛛打开
2016-04-06 00:27:24[抓取]信息：抓取0页（以0页/分钟的速度），抓取0项（以0项/分钟的速度）
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.open_spider（）
2016-04-06 00:27:24[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.process_request（）
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.process_request（）
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.process_request（）
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.process_request（）
2016-04-06 00:27:24[httpbin]信息：TestMiddleware.process_request（）
2016-04-06 00:27:24[scrapy]调试：爬网（200）（参考：无）
2016-04-06 00:27:25[scrapy]调试：爬网（200）（参考：无）
2016-04-06 00:27:25[scrapy]调试：爬网（200）（参考：无）
2016-04-06 00:27:25[scrapy]信息：关闭卡盘（已完成）
2016-04-06 00:27:25[scrapy]信息：倾销scrapy统计数据：
{'downloader/exception_count'：2，
“downloader/exception\u type\u count/scrapy.exceptions.IgnoreRequest”：2，
“下载程序/请求字节”：665，
“下载程序/请求计数”：3，
“下载程序/请求方法\计数/获取”：3，
“downloader/response_字节”：13006，
“下载程序/响应计数”：3，
“下载/响应状态\计数/200”：3，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2016,4,5,22,27,25596652），
“日志计数/调试”：4，
“日志计数/信息”：13，
“日志计数/警告”：1，
“收到的响应数”：3，
“调度程序/出列”：5，
“调度程序/出列/内存”：5，
“调度程序/排队”：5，
“调度程序/排队/内存”：5，
“开始时间”：datetime.datetime（2016,4,5,22,27,24,661345）}
2016-04-06 00:27:25[scrapy]信息：卡盘关闭（完成）
来自评论的OP的中间件代码
def __init__(self, crawler):
    self.client = Redis()
    self.crawler = crawler
    self.crawler.signals.connect(self.process_request, signals.request_scheduled)

def process_request(self, request, spider):
    if not self.client.is_deferred(request.url): # URL is not deferred, proceed as normal
        return None
    raise IgnoreRequest('URL is deferred')

问题在于您在signals.request\u scheduled
上添加的信号处理程序。如果它引发了一个异常
我认为在这里将process\u request
注册为信号处理程序是不正确的（或不必要的）
我可以用这个类似的（不正确的）测试中间件重现您的控制台错误，它会忽略它看到的所有其他请求：
from scrapy import log, signals
from scrapy.exceptions import IgnoreRequest

class TestMiddleware(object):

    def __init__(self, crawler):
        self.counter = 0

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler)
        crawler.signals.connect(o.open_spider, signals.spider_opened)

        # this raises an exception always and will trigger errors in the console
        crawler.signals.connect(o.process, signals.request_scheduled)
        return o

    def open_spider(self, spider):
        spider.logger.info('TestMiddleware.open_spider()')

    def process_request(self, request, spider):
        spider.logger.info('TestMiddleware.process_request()')
        self.counter += 1
        if (self.counter % 2) == 0:
            raise IgnoreRequest("ignoring request %d" % self.counter)

    def process(self, *args, **kwargs):
        raise Exception

查看控制台在使用此中间件运行spider时的说明：
2016-04-06 00:16:58 [scrapy] ERROR: Error caught on signal handler: <bound method ?.process of <mwtest.middlewares.TestMiddleware object at 0x7f83d4a73f50>>
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy11rc3.py27/local/lib/python2.7/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/home/paul/.virtualenvs/scrapy11rc3.py27/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/paul/tmp/mwtest/mwtest/middlewares.py", line 26, in process
    raise Exception
Exception

IgnoreRequest
未打印在日志中，但您在末尾的统计中有异常计数：
$ scrapy crawl httpbin
2016-04-06 00:27:24 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: mwtest)
(...)
2016-04-06 00:27:24 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'mwtest.middlewares.TestMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
(...)
2016-04-06 00:27:24 [scrapy] INFO: Spider opened
2016-04-06 00:27:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.open_spider()
2016-04-06 00:27:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [httpbin] INFO: TestMiddleware.process_request()
2016-04-06 00:27:24 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/user-agent> (referer: None)
2016-04-06 00:27:25 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2016-04-06 00:27:25 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/headers> (referer: None)
2016-04-06 00:27:25 [scrapy] INFO: Closing spider (finished)
2016-04-06 00:27:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 2,
 'downloader/request_bytes': 665,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 13006,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 5, 22, 27, 25, 596652),
 'log_count/DEBUG': 4,
 'log_count/INFO': 13,
 'log_count/WARNING': 1,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2016, 4, 5, 22, 27, 24, 661345)}
2016-04-06 00:27:25 [scrapy] INFO: Spider closed (finished)

$scrapy crawl httpbin
2016-04-06 00:27:24[scrapy]信息：scrapy 1.1.0rc3已启动（bot:mwtest）
(...)
2016-04-06 00:27:24[scrapy]信息：已启用的下载程序中间件：
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware'，
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware'，
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware'，
'scrapy.DownloaderMiddleware.retry.RetryMiddleware'，
“mwtest.middleware.TestMiddleware”，
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware'，
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware'，
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware'，
的