Python 如何通过单个管道运行多个spider?
总noob刚开始用scrapy 在我的目录结构中,我有这样的Python 如何通过单个管道运行多个spider?,python,scrapy,scrapy-pipeline,Python,Scrapy,Scrapy Pipeline,总noob刚开始用scrapy 在我的目录结构中,我有这样的 #FYI: running on Scrapy 2.4.1 WebScraper/ Webscraper/ spiders/ spider.py # (NOTE: contains spider1 and spider2 classes.) items.py middlewares.py pipelines.py # (NOTE: contains spider1
#FYI: running on Scrapy 2.4.1
WebScraper/
Webscraper/
spiders/
spider.py # (NOTE: contains spider1 and spider2 classes.)
items.py
middlewares.py
pipelines.py # (NOTE: contains spider1Pipeline and spider2Pipeline)
settings.py # (NOTE: I wrote here:
#ITEM_PIPELINES = {
# 'WebScraper.pipelines.spider1_pipelines': 300,
# 'WebScraper.pipelines.spider2_pipelines': 300,
#}
scrapy.cfg
而且spider2.py
类似于
class OneSpider(scrapy.Spider):
name = "spider1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "spider2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
使用pipelines.py
看起来像
class spider1_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
self.csvwriter.writerow(['header1', 'header2'])
def process_item(self, item, spider):
row = []
row.append(item['header1'])
row.append(item['header2'])
self.csvwrite.writerow(row)
class spider2_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
self.csvwriter.writerow(['header_a', 'header_b'])
def process_item(self, item, spider):
row = []
row.append(item['header_a']) #NOTE: this is not the same as header1
row.append(item['header_b']) #NOTE: this is not the same as header2
self.csvwrite.writerow(row)
关于使用一个终端命令在不同的URL上运行spider1和spider2,我有一个问题:
nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log
注:这是前一个问题(2018年)的延伸
所需的结果:spider1.csv包含来自spider1的数据,spider2.csv包含来自spider2的数据
当前结果:spider1.csv包含来自spider1的数据,spider2.csv中断,但错误日志包含spider2数据,并且存在keyerror['header1']
,即使spider2的项不包括header1
,但它只包括header\u a
是否有人知道如何在不同的URL上一个接一个地运行spider,并将spider1、spider2等获取的数据插入特定于该spider的管道,如spider1->spider1Pipeline->spider1.csv、spider2->spider2Pipelines->spider2.csv
或者,这可能是从items.py中指定spider1_项
和spider2_项
的问题?我想知道是否可以指定在何处以这种方式插入spider2的数据
谢谢大家! 您可以使用spider属性来实现这一点,以便为每个spider单独设置设置
#spider2.py
一级蜘蛛(刮毛蜘蛛):
name=“蜘蛛1”
自定义设置={
'ITEM_PIPELINES':{'WebScraper.PIPELINES.spider1_PIPELINES':300}
...
二级蜘蛛(刮毛蜘蛛):
name=“蜘蛛2”
自定义设置={
'ITEM_PIPELINES':{'WebScraper.PIPELINES.spider2_PIPELINES':300}
...
我觉得我应该澄清一点——我不打算异步运行spider,只是一个接一个。我希望执行spider1->spider1_pipeline->spider1.csv,然后执行spider2->spider2_pipeline->spider2.csv。如果有更有效的方法来编写此代码,我也非常感谢您的建议。这非常有效,谢谢!