Python 如何通过单个管道运行多个spider?

Python 如何通过单个管道运行多个spider?,python,scrapy,scrapy-pipeline,Python,Scrapy,Scrapy Pipeline,总noob刚开始用scrapy 在我的目录结构中,我有这样的 #FYI: running on Scrapy 2.4.1 WebScraper/ Webscraper/ spiders/ spider.py # (NOTE: contains spider1 and spider2 classes.) items.py middlewares.py pipelines.py # (NOTE: contains spider1

总noob刚开始用scrapy

在我的目录结构中,我有这样的

#FYI: running on Scrapy 2.4.1
WebScraper/
  Webscraper/
     spiders/
        spider.py    # (NOTE: contains spider1 and spider2 classes.)
     items.py
     middlewares.py
     pipelines.py    # (NOTE: contains spider1Pipeline and spider2Pipeline)
     settings.py     # (NOTE: I wrote here:
                     #ITEM_PIPELINES = {
                     #  'WebScraper.pipelines.spider1_pipelines': 300,
                     #  'WebScraper.pipelines.spider2_pipelines': 300,
                     #} 
  scrapy.cfg
而且
spider2.py
类似于

class OneSpider(scrapy.Spider):
    name = "spider1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "spider2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff
使用
pipelines.py
看起来像

class spider1_pipelines(object): 
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
        self.csvwriter.writerow(['header1', 'header2'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header1'])
        row.append(item['header2'])
        self.csvwrite.writerow(row)
        
class spider2_pipelines(object):
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
        self.csvwriter.writerow(['header_a', 'header_b'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header_a']) #NOTE: this is not the same as header1
        row.append(item['header_b']) #NOTE: this is not the same as header2
        self.csvwrite.writerow(row)
关于使用一个终端命令在不同的URL上运行spider1和spider2,我有一个问题:

nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log
注:这是前一个问题(2018年)的延伸

所需的结果:spider1.csv包含来自spider1的数据,spider2.csv包含来自spider2的数据

当前结果:spider1.csv包含来自spider1的数据,spider2.csv中断,但错误日志包含spider2数据,并且存在
keyerror['header1']
,即使spider2的项不包括
header1
,但它只包括
header\u a

是否有人知道如何在不同的URL上一个接一个地运行spider,并将spider1、spider2等获取的数据插入特定于该spider的管道,如spider1->spider1Pipeline->spider1.csv、spider2->spider2Pipelines->spider2.csv

或者,这可能是从items.py中指定
spider1_项
spider2_项
的问题?我想知道是否可以指定在何处以这种方式插入spider2的数据


谢谢大家!

您可以使用spider属性来实现这一点,以便为每个spider单独设置设置

#spider2.py
一级蜘蛛(刮毛蜘蛛):
name=“蜘蛛1”
自定义设置={
'ITEM_PIPELINES':{'WebScraper.PIPELINES.spider1_PIPELINES':300}
...
二级蜘蛛(刮毛蜘蛛):
name=“蜘蛛2”
自定义设置={
'ITEM_PIPELINES':{'WebScraper.PIPELINES.spider2_PIPELINES':300}
...

我觉得我应该澄清一点——我不打算异步运行spider,只是一个接一个。我希望执行spider1->spider1_pipeline->spider1.csv,然后执行spider2->spider2_pipeline->spider2.csv。如果有更有效的方法来编写此代码,我也非常感谢您的建议。这非常有效,谢谢!