Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/EmptyTag/161.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在for循环中运行多个spider_Python_Scrapy_Twisted_Scrapy Spider - Fatal编程技术网

Python 在for循环中运行多个spider

Python 在for循环中运行多个spider,python,scrapy,twisted,scrapy-spider,Python,Scrapy,Twisted,Scrapy Spider,我尝试实例化多个spider。第一个很好,但是第二个给了我一个错误:ReactorNotRestartable feeds = { 'nasa': { 'name': 'nasa', 'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss', 'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss'] },

我尝试实例化多个spider。第一个很好,但是第二个给了我一个错误:ReactorNotRestartable

feeds = {
    'nasa': {
        'name': 'nasa',
        'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss',
        'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss']
    },
    'xkcd': {
        'name': 'xkcd',
        'url': 'http://xkcd.com/rss.xml',
        'start_urls': ['http://xkcd.com/rss.xml']
    }    
}
对于上面的项目,我尝试在一个循环中运行两个spider,如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):

    name = None

    def __init__(self, **kwargs):

        this_feed = feeds[self.name]
        self.start_urls = this_feed.get('start_urls')
        self.iterator = 'iternodes'
        self.itertag = 'items'
        super(MySpider, self).__init__(**kwargs)

def parse_node(self, response, node):
    pass


def start_crawler():
    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for feed_name in feeds.keys():
        MySpider.name = feed_name
        process.crawl(MySpider)
        process.start() 
第二个循环的例外情况如下所示,spider已打开,但是:

...
2015-11-22 00:00:00 [scrapy] INFO: Spider opened
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
  File "env/bin/start_crawler", line 9, in <module>
    load_entry_point('feed-crawler==0.0.1', 'console_scripts', 'start_crawler')()
  File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py", line 51, in start_crawler
    process.start() # the script will block here until the crawling is finished
  File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
。。。
2015-11-22 00:00:00[scrapy]信息:蜘蛛打开
2015-11-22 00:00:00[scrapy]信息:爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)
2015-11-22 00:00:00[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2015-11-21 23:54:05[scrapy]调试:Telnet控制台监听127.0.0.1:6023
回溯(最近一次呼叫最后一次):
文件“env/bin/start_crawler”,第9行,在
加载入口点('feed-crawler==0.0.1','console\u scripts','start\u crawler')()
文件“/Users/bling/py feeds crawler/feed\u crawler/crawl.py”,第51行,在start\u crawler中
process.start()#脚本将在此处阻塞,直到爬网完成
文件“/Users/bling/py feeds crawler/env/lib/python2.7/site packages/scrapy/crawler.py”,第251行,在开始处
reactor.run(installSignalHandlers=False)#阻止调用
文件“/usr/local/lib/python2.7/site packages/twisted/internet/base.py”,第1193行,运行中
self.startRunning(installSignalHandlers=installSignalHandlers)
文件“/usr/local/lib/python2.7/site packages/twisted/internet/base.py”,第1173行,在startRunning中
反应器基础启动耳轴(自)
文件“/usr/local/lib/python2.7/site packages/twisted/internet/base.py”,第684行,在startRunning中
引发错误。ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我是否必须让第一个神秘感失效,或者我做错了什么,需要改变它的工作原理。提前感谢。

看起来您必须为每个spider实例化一个进程,请尝试:

def start_crawler():      

    for feed_name in feeds.keys():
        process = CrawlerProcess({
            'USER_AGENT': CONFIG['USER_AGENT'],
            'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
        })
        MySpider.name = feed_name
        process.crawl(MySpider)
        process.start() 

解决方案是收集循环中的爬行器,并在结束时只启动一次进程。我猜,这与反应堆分配/解除分配有关

def start_crawler():

    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # disable for issues with boto
    })

    for feed_name in CONFIG['Feeds'].keys():
        MySpider.name = feed_name
        process.crawl(MySpider)

    process.start()

感谢@eLRuLL的回答,它启发我找到了这个解决方案。

您可以在爬网中发送参数,并在解析过程中使用它们

class MySpider(XMLFeedSpider):
    def __init__(self, name, **kwargs):
        super(MySpider, self).__init__(**kwargs)

        self.name = name


def start_crawler():      
    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for feed_name in feeds.keys():
        process.crawl(MySpider, feed_name)

    process.start()

这确实更有意义,但仍然是一个例外。