Python 2.7 当包含特定管道时,刮擦管道订单会混乱

Python 2.7 当包含特定管道时,刮擦管道订单会混乱,python-2.7,web-scraping,scrapy,Python 2.7,Web Scraping,Scrapy,我有6条管道,在settings.py中定义如下: ITEM_PIPELINES = { 'SiteCrawler.pipelines.DuplicatesPipeline': 100, #'SiteCrawler.pipelines.ScreenshotPipeline': 200, 'SiteCrawler.pipelines.NodesPipeline': 300, 'SiteCrawler.pipelines.EdgesPipeline': 400,

我有6条管道,在settings.py中定义如下:

ITEM_PIPELINES = {
    'SiteCrawler.pipelines.DuplicatesPipeline': 100,
    #'SiteCrawler.pipelines.ScreenshotPipeline': 200,
    'SiteCrawler.pipelines.NodesPipeline': 300,
    'SiteCrawler.pipelines.EdgesPipeline': 400,
    'SiteCrawler.pipelines.ParentsPipeline': 500,
    'SiteCrawler.pipelines.TextAnalysisPipeline': 600,
}
当屏幕截图管道像这样被忽略时,管道以正确的顺序运行。我知道这一点,因为我会记录每个管道的使用时间。创建订单的方法是。但是,当我包含屏幕截图管道时,顺序变得混乱。1,3,4,5,6变为6,4,2,5,3,1。我需要它们的顺序正确

以下是屏幕截图管道的代码,以防其中的任何内容可以解释:

class ScreenshotPipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.date = spider.unix_date
        self.domain = spider.scrape_domain
        self.driver = webdriver.PhantomJS(executable_path=phantomjs_path)
        self.driver.set_page_load_timeout(15)

    def process_item(self, item, spider):
        log.msg("screenshot")
        try:
            self.driver.get(item['url_obj'].full)
        except:
            self.driver = webdriver.PhantomJS(executable_path=phantomjs_path)
            self.driver.set_page_load_timeout(15)
            self.driver.get(item['url_obj'].full)
        WebDriverWait(self.driver, timeout=2)
        self.driver.save_screenshot(r"{0}\initiator\static\scrapes\{1}\{2}\{3}.png".format(getcwd(), self.domain, self.date, item['url_obj'].name))
        return item

    def spider_closed(self, spider):
        pass

谢谢您的帮助。

我对管道也有同样的问题。它们不是根据项目字典中给出的索引排序的。我重新命名了管道,奇怪的是,执行顺序改变了。但是,它仍然不是正确的顺序

最后,我将ITEM_管道从一个字典更改为一个列表(按照所需的执行顺序),这样做很有效。根据报告:

因此,您可以尝试如下设置管道(只要列表仍然受支持):


问题有时仍然会出现,但不幸的是,这在Python3-Scrapy1.8中不起作用。
Lists are supported in ITEM_PIPELINES for backwards compatibility, but they are deprecated.
ITEM_PIPELINES = [
    'SiteCrawler.pipelines.DuplicatesPipeline',
    'SiteCrawler.pipelines.ScreenshotPipeline',
    'SiteCrawler.pipelines.NodesPipeline',
    'SiteCrawler.pipelines.EdgesPipeline',
    'SiteCrawler.pipelines.ParentsPipeline',
    'SiteCrawler.pipelines.TextAnalysisPipeline',
]