Python 在不阻塞流程的情况下启动scrapy multiple spider
我试图在单独的脚本中执行scrapy spider,当我在循环中执行此脚本时(例如,使用不同的参数运行同一个spider),我会得到Python 在不阻塞流程的情况下启动scrapy multiple spider,python,twisted,scrapy,Python,Twisted,Scrapy,我试图在单独的脚本中执行scrapy spider,当我在循环中执行此脚本时(例如,使用不同的参数运行同一个spider),我会得到ReactorAlreadyRunning。我的片段: from celery import task from episode.skywalker.crawlers import settings from multiprocessing.queues import Queue from scrapy import log, project, signals fr
ReactorAlreadyRunning
。我的片段:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
@task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
@param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
我的建议是,它是由多次扭曲的反应堆运行引起的。
我怎样才能避免呢?一般来说,有没有一种方法可以在没有反应器的情况下运行spider?我发现了问题的原因:如果
在CrawlerWorker
进程中调用execute\u spider
方法(例如通过递归),则会导致创建第二个反应器,这是不可能的
我的解决方案是:在execute\u spider
方法中移动所有语句,导致递归调用,这样它们将在同一进程中触发spider执行,而不是在辅助CrawlerWorker
中。我也写了这样一句话
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
捕捉爬行器的无意递归调用。您的问题不是“如何在没有反应器的情况下运行爬行器”,而是“如何运行多个爬行器”。关注反应堆是否运行以及谁运行很可能会让你得到错误的答案。我想知道我是否可以启动多个spider而不阻塞主进程。不过,谢谢你的更正。