Python 如何运行多个Scrapy爬行器,每个爬行器都会刮取不同的URL?
我有一个Python 如何运行多个Scrapy爬行器,每个爬行器都会刮取不同的URL?,python,scrapy,Python,Scrapy,我有一个spider.py在一个有以下spider的Scrapy项目中 class OneSpider(scrapy.Spider): name = "s1" def start_requests(self): urls = ["url1.com",] yield scrapy.Request( url="http://url1.com", callback=self.parse )
spider.py
在一个有以下spider的Scrapy项目中
class OneSpider(scrapy.Spider):
name = "s1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "s2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
如何运行spider
s1
和s2
,并将它们的刮取结果写入s1.json
和s2.json
?Scrapy不支持将多个spider作为单个进程运行,因此您只需运行两个进程:
scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json
如果要在同一终端窗口中执行此操作,则必须:
- 运行第一个十字轴->将其置于后台(ctrl+z)->运行第二个十字轴
- 使用
,例如:nohup
nohup scrapy crawl s1 -o s1.json --logfile s1.log &
- 使用
命令屏幕
$ screen $ scrapy crawl s1 -o s1.json $ ctrl+a ctrL+d # detach screen $ screen $ scrapy crawl s2 -o s2.json $ ctrl+a ctrL+d # detach screen $ screen -r # to reattach to one of your sessions to see how the spider is doing