Scrapy 在CustomDownloaderMiddware中引发IgnoreRequest无法正常工作
我已经编写了自己的scrapy下载中间件来简单地检查数据库中的exist request.url,如果是这样,则引发IgnoreRequestfScrapy 在CustomDownloaderMiddware中引发IgnoreRequest无法正常工作,scrapy,scrapy-middleware,Scrapy,Scrapy Middleware,我已经编写了自己的scrapy下载中间件来简单地检查数据库中的exist request.url,如果是这样,则引发IgnoreRequestf def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return N
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
sql = """SELECT url FROM domain_sold WHERE url = %s;"""
try:
cursor = spider.db_connection.cursor()
cursor.execute(sql, (request.url,))
is_seen = cursor.fetchone()
cursor.close()
if is_seen:
raise IgnoreRequest('duplicate url {}'.format(request.url))
except (Exception, psycopg2.DatabaseError) as error:
self.logger.error(error)
return None
如果引发IgnoreRequest,我希望爬行器将继续处理另一个请求,但在我的情况下,爬行器仍将继续抓取该请求,并通过我的自定义管道通过该项
我目前的dl mw设置如下所示
“下载器\中间产品”:{
'realestate.middleware.RealestateDownloaderMiddleware':99
任何人都可以解释为什么会发生这种情况。感谢
IgnoreRequest
继承自基本的异常
类,然后您会立即在中捕获该异常并进行日志记录,这样它就不会传播到足以忽略请求的程度
更改:
except (Exception, psycopg2.DatabaseError) as error:
致:
这是正确的,但更好的简明答案是删除try/except,因为进程_请求
应该是:返回None、返回Response对象、返回request对象或引发IgnoreRequest(即无需捕获错误)@wishmaster这将意味着任何数据库异常都将丢失,并且不会显式记录…看起来上面的操作始终不会返回None或引发IgnoreRequest…(任何其他可能发生的异常都会失败…)看起来OP想要记录DB异常,而不是让它们传播,但是在他们的Exception中有一个相当广泛的异常
,有点过分热心了clause@JonClements谢谢。你的解决方案解决了我的问题
except psycopg2.DatabaseError as error: