Python 在Scrapy本地运行所有蜘蛛
有没有一种方法可以在不使用Scrapy守护进程的情况下运行Scrapy项目中的所有spider?过去有一种方法可以通过Python 在Scrapy本地运行所有蜘蛛,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,有没有一种方法可以在不使用Scrapy守护进程的情况下运行Scrapy项目中的所有spider?过去有一种方法可以通过scrapy-crawl运行多个爬行器,但该语法被删除,scrapy的代码发生了很大的变化 我尝试创建自己的命令: from scrapy.command import ScrapyCommand from scrapy.utils.misc import load_object from scrapy.conf import settings class Command(Sc
scrapy-crawl
运行多个爬行器,但该语法被删除,scrapy的代码发生了很大的变化
我尝试创建自己的命令:
from scrapy.command import ScrapyCommand
from scrapy.utils.misc import load_object
from scrapy.conf import settings
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spman_cls = load_object(settings['SPIDER_MANAGER_CLASS'])
spiders = spman_cls.from_settings(settings)
for spider_name in spiders.list():
spider = self.crawler.spiders.create(spider_name)
self.crawler.crawl(spider)
self.crawler.start()
但是,一旦一个爬行器注册到self.crawler.crawl()
,我就会得到所有其他爬行器的断言错误:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run
self.crawler.crawl(spider)
File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl
return self.engine.open_spider(spider, requests)
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider
spider.name
exceptions.AssertionError: No free spider slots when opening 'spidername'
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python2.7/site packages/scrapy/cmdline.py”,第138行,在_run_命令中
cmd.run(参数、选项)
文件“/home/blender/Projects/scrapers/store\u crawler/store\u crawler/commands/crawall.py”,第22行,运行中
self.crawler.crawl(蜘蛛)
文件“/usr/lib/python2.7/site packages/scrapy/crawler.py”,第47行,在爬网中
返回自引擎。打开十字轴(十字轴,请求)
文件“/usr/lib/python2.7/site packages/twisted/internet/defer.py”,第1214行,在unwindGenerator中
return _inlineCallbacks(无、gen、Deferred())
--- ---
文件“/usr/lib/python2.7/site packages/twisted/internet/defer.py”,第1071行,在内联回调中
结果=g.send(结果)
open_spider中的文件“/usr/lib/python2.7/site packages/scrapy/core/engine.py”,第215行
蜘蛛的名字
exceptions.AssertionError:打开“spidername”时没有可用的spider插槽
有没有办法做到这一点?我不希望为了像这样运行我所有的爬行器而开始对核心Scrapy组件进行子类化。下面是一个示例,它不在自定义命令中运行,而是手动运行Reactor并创建一个新的爬行器: 您必须设计在所有蜘蛛完成后停止反应器 编辑:下面是如何在自定义命令中运行多个spider:
from scrapy.command import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import Crawler
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
settings = get_project_settings()
for spider_name in self.crawler.spiders.list():
crawler = Crawler(settings)
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
self.crawler.start()
你为什么不使用这样的东西:
scrapy list|xargs -n 1 scrapy crawl
?在Scrapy 1.0中,@Steven Almeroth的答案将失败,您应该这样编辑脚本:
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
class Command(ScrapyCommand):
requires_project = True
excludes = ['spider1']
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
settings = get_project_settings()
crawler_process = CrawlerProcess(settings)
for spider_name in crawler_process.spider_loader.list():
if spider_name in self.excludes:
continue
spider_cls = crawler_process.spider_loader.load(spider_name)
crawler_process.crawl(spider_cls)
crawler_process.start()
此代码适用于我的scrapy版本is 1.3.3(将其保存在scrapy.cfg的同一目录中): 对于scrapy 1.5.x(因此您不会收到弃用警告)
你用的是什么胶皮版
$scrapy version-v
你知道吗?0.16.4
。我确实知道Scrapyd,但我正在本地测试这些爬行器,所以我不想使用它。谢谢,这正是我想做的。如何启动程序?将代码放入文本编辑器并另存为mycolcrawler.py
。在Linux中,您可能可以从保存python mycolcrawler.py的目录中的命令行运行它。在Windows中,您可以从文件管理器双击它。您能解释一下爬虫程序
和爬虫程序
之间的区别吗?好的,Spider
控制响应的处理方式(如刮取哪些项目以及如何提取链接…),然后Crawler
做什么?@Alcott没错,Spider处理响应,Crawler处理Spider:实例化它们,配置设置和中间件,等等。使用-p0
选项xargs
并行运行所有spider。我还使用了上面Yuda Prawira给出的答案,该答案在Scrapy 1.5.2中仍然有效,但是我得到了这个警告:scrapydeproductionwarning:CrawlerRunner.spider属性被重命名为CrawlerRunner.spider\u loader。
您所要做的就是更改for循环中的名称:for spider in process.spider\u loader.list():…
否则仍然有效!
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
class Command(ScrapyCommand):
requires_project = True
excludes = ['spider1']
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
settings = get_project_settings()
crawler_process = CrawlerProcess(settings)
for spider_name in crawler_process.spider_loader.list():
if spider_name in self.excludes:
continue
spider_cls = crawler_process.spider_loader.load(spider_name)
crawler_process.crawl(spider_cls)
crawler_process.start()
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spider_loader.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()