scrapy spider是否同时从多个域下载?
我正在尝试同时刮取2个域。我创造了这样一只蜘蛛:scrapy spider是否同时从多个域下载?,scrapy,Scrapy,我正在尝试同时刮取2个域。我创造了这样一只蜘蛛: class TestSpider(CrawlSpider): name = 'test-spider' allowed_domains = [ 'domain-a.com', 'domain-b.com' ] start_urls = [ 'http://www.domain-a.com/index.html', 'http://www.domain-b.com/index.htm
class TestSpider(CrawlSpider):
name = 'test-spider'
allowed_domains = [ 'domain-a.com', 'domain-b.com' ]
start_urls = [ 'http://www.domain-a.com/index.html',
'http://www.domain-b.com/index.html' ]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_item'),
)
def parse_item(self, response):
log.msg('parsing ' + response.url, log.DEBUG)
我希望在输出中看到“domain-a.com和domain-b.com”条目的混合,但我只看到日志中提到的domain-a。但是,如果我运行单独的爬行器/爬虫器,我确实会看到两个域同时被抓取(不是实际的代码,但说明了这一点):
谢谢可能值得检查爬网顺序-深度优先(默认)可能有利于域a,请参见谢谢shane,我将查看此内容
def setup_crawler(url):
spider = TestSpider(start_url=url)
crawler = Crawler(get_project_settings())
crawler.configure()
crawler.signals.connect(reactor.stop(), signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
setup_crawler('http://www.domain-a.com/index.html')
setup_crawler('http://www.domain-b.com/index.html')
log.start(loglevel=log.DEBUG)
reactor.run()