Python 2.7 当包含特定管道时,刮擦管道订单会混乱
我有6条管道,在settings.py中定义如下:Python 2.7 当包含特定管道时,刮擦管道订单会混乱,python-2.7,web-scraping,scrapy,Python 2.7,Web Scraping,Scrapy,我有6条管道,在settings.py中定义如下: ITEM_PIPELINES = { 'SiteCrawler.pipelines.DuplicatesPipeline': 100, #'SiteCrawler.pipelines.ScreenshotPipeline': 200, 'SiteCrawler.pipelines.NodesPipeline': 300, 'SiteCrawler.pipelines.EdgesPipeline': 400,
ITEM_PIPELINES = {
'SiteCrawler.pipelines.DuplicatesPipeline': 100,
#'SiteCrawler.pipelines.ScreenshotPipeline': 200,
'SiteCrawler.pipelines.NodesPipeline': 300,
'SiteCrawler.pipelines.EdgesPipeline': 400,
'SiteCrawler.pipelines.ParentsPipeline': 500,
'SiteCrawler.pipelines.TextAnalysisPipeline': 600,
}
当屏幕截图管道像这样被忽略时,管道以正确的顺序运行。我知道这一点,因为我会记录每个管道的使用时间。创建订单的方法是。但是,当我包含屏幕截图管道时,顺序变得混乱。1,3,4,5,6变为6,4,2,5,3,1。我需要它们的顺序正确
以下是屏幕截图管道的代码,以防其中的任何内容可以解释:
class ScreenshotPipeline(object):
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
self.date = spider.unix_date
self.domain = spider.scrape_domain
self.driver = webdriver.PhantomJS(executable_path=phantomjs_path)
self.driver.set_page_load_timeout(15)
def process_item(self, item, spider):
log.msg("screenshot")
try:
self.driver.get(item['url_obj'].full)
except:
self.driver = webdriver.PhantomJS(executable_path=phantomjs_path)
self.driver.set_page_load_timeout(15)
self.driver.get(item['url_obj'].full)
WebDriverWait(self.driver, timeout=2)
self.driver.save_screenshot(r"{0}\initiator\static\scrapes\{1}\{2}\{3}.png".format(getcwd(), self.domain, self.date, item['url_obj'].name))
return item
def spider_closed(self, spider):
pass
谢谢您的帮助。我对管道也有同样的问题。它们不是根据项目字典中给出的索引排序的。我重新命名了管道,奇怪的是,执行顺序改变了。但是,它仍然不是正确的顺序 最后,我将ITEM_管道从一个字典更改为一个列表(按照所需的执行顺序),这样做很有效。根据报告: 因此,您可以尝试如下设置管道(只要列表仍然受支持):
问题有时仍然会出现,但不幸的是,这在Python3-Scrapy1.8中不起作用。
Lists are supported in ITEM_PIPELINES for backwards compatibility, but they are deprecated.
ITEM_PIPELINES = [
'SiteCrawler.pipelines.DuplicatesPipeline',
'SiteCrawler.pipelines.ScreenshotPipeline',
'SiteCrawler.pipelines.NodesPipeline',
'SiteCrawler.pipelines.EdgesPipeline',
'SiteCrawler.pipelines.ParentsPipeline',
'SiteCrawler.pipelines.TextAnalysisPipeline',
]