Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址?

Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址?,python,scrapy,Python,Scrapy,我已经实现了一个爬虫程序,它从文本文件中获取URL并刮取所有URL,然后停止 我的实施: class CoreSpider(scrapy.Spider): name = "final" custom_settings = { 'ROBOTSTXT_OBEY': 'False', 'HTTPCACHE_ENABLED': 'True', 'LOG_ENABLED': 'False', 'DOWNLO

我已经实现了一个爬虫程序,它从文本文件中获取URL并刮取所有URL,然后停止

我的实施:

class CoreSpider(scrapy.Spider):
    name = "final"
    custom_settings = {
        'ROBOTSTXT_OBEY': 'False',
        'HTTPCACHE_ENABLED': 'True',
        'LOG_ENABLED': 'False',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'random_useragent.RandomUserAgentMiddleware': 320
        },
    }

    def __init__(self):
        self.all_ngrams = get_ngrams()
        # logging.DEBUG(self.all_ngrams)
        self.search_term = ""
        self.start_urls = self.read_url()
        self.rules = (
            Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
        )
 .....
 .....
我从脚本运行此spider,如下所示:

process = CrawlerProcess(get_project_settings())
process.crawl(CoreSpider)
process.start()
它给出了“错误”
twisted.internet.error.ReactorNotRestartable
一旦完成对所有URL的抓取

我尝试使用下面这样的实现,它给出了和前面相同的错误

runner = CrawlerRunner(get_project_settings())
d = runner.crawl(CoreSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
然后我试着像这样跑蜘蛛:

runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
    yield runner.crawl(CoreSpider)
    reactor.stop()

crawl()
reactor.run()
但它仍然给出了同样的错误

一旦所有URL都被删除,如何手动停止爬行器

更新:Python 2.7堆栈跟踪

Traceback (most recent call last):
  File "seed_list_generator.py", line 768, in <module>
    process = CrawlerProcess(get_project_settings())
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/root/anaconda2/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
    process.start()
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
回溯(最近一次呼叫最后一次):
文件“seed_list_generator.py”,第768行,在
进程=爬网进程(获取项目设置()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第243行,在__
超级(爬虫进程,自我)。\uuuuu初始化\uuuuu(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第134行,在__
self.spider\u loader=\u get\u spider\u loader(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第330行,在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.(settings.frozencopy())
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第61行,在from_设置中
返回cls(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第25行,在__
self.\u加载\u所有\u蜘蛛()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第47行,在所有spider中
对于walk_模块中的模块(名称):
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/utils/misc.py”,第71行,在walk_模块中
子模块=导入模块(完整路径)
文件“/root/anaconda2/lib/python2.7/importlib/_init__uuu.py”,第37行,在导入模块中
__导入(名称)
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”,第12行,在
process.start()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第285行,开始
reactor.run(installSignalHandlers=False)#阻止调用
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第1242行,运行中
self.startRunning(installSignalHandlers=installSignalHandlers)
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第1222行,在startRunning中
反应器基础启动耳轴(自)
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第730行,在startRunning中
引发错误。ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Python 3.6回溯:

 File "seed_list_generator.py", line 769, in <module>
    process = CrawlerProcess(get_project_settings())
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 249, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 336, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/root/anaconda3/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
    process.start()
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
文件“seed\u list\u generator.py”,第769行,在
进程=爬网进程(获取项目设置()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第249行,在__
超级(爬虫进程,自我)。\uuuuu初始化\uuuuu(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第137行,在__
self.spider\u loader=\u get\u spider\u loader(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第336行,在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.(settings.frozencopy())
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第61行,在from_设置中
返回cls(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第25行,在__
self.\u加载\u所有\u蜘蛛()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第47行,在所有spider中
对于walk_模块中的模块(名称):
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/utils/misc.py”,第71行,在walk_模块中
子模块=导入模块(完整路径)
文件“/root/anaconda3/lib/python3.6/importlib/\uuuu init\uuuuu.py”,第126行,在导入模块中
return _bootstrap._gcd_import(名称[级别:],包,级别)
文件“”,第978行,在_gcd_import中
文件“”,第961行,在“查找”和“加载”中
文件“”,第950行,在“查找”和“加载”中解锁
文件“”,第655行,已加载
exec_模块中第678行的文件“”
文件“”,第205行,在调用中删除了帧
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”,第12行,在
process.start()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第291行,开始
reactor.run(installSignalHandlers=False)#阻止调用
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第1242行,运行中
self.startRunning(installSignalHandlers=installSignalHandlers)
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第1222行,在startRunning中
反应器基础启动耳轴(自)
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第730行,在startRunning中
引发错误。ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

提前感谢。

这样更改代码会发生什么

class CoreSpider(scrapy.Spider):
    name = "final"
    custom_settings = {
        'ROBOTSTXT_OBEY': 'False',
        'HTTPCACHE_ENABLED': 'True',
        'LOG_ENABLED': 'False',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'random_useragent.RandomUserAgentMiddleware': 320
        },
    }

    def __init__(self,*args,**kwargs):
        # python 3
        super().__init__(*args,**kwargs)
        # python 2
        # super(CoreSpider, self).__init__(*args, **kwargs)

        self.all_ngrams = get_ngrams()
        # logging.DEBUG(self.all_ngrams)
        self.search_term = ""
        self.start_urls = self.read_url()
        self.rules = (
            Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
        )
 .....
 .....

最后,我将爬虫程序放入
if\uuuuuu name\uuuuuu==“\uuuuu main\uuuuu”
块中,成功地阻止了它

if __name__ == '__main__':
    process = CrawlerProcess(get_project_settings())
    process.crawl(CoreSpider)
    process.start()

一旦爬虫程序完成删除所有URL,它就会优雅地停止爬虫程序。

你能发布异常的完整堆栈跟踪吗?@TarunLalwani,我用回溯错误更新了这个问题。我认为示例代码可能是Python 3和较新版本的scrapy,这可能是导致问题的原因。你能用Python 3试试吗?它不起作用还有Python 3。添加了对Python的回溯。所以我在找到你的帖子之前工作了几个小时。你知道幕后发生了什么吗。如果你把
if\uu name\uu=='\uu main\uu':