Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址？_Python_Scrapy

Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址？

python scrapy

Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址？,python,scrapy,Python,Scrapy,我已经实现了一个爬虫程序，它从文本文件中获取URL并刮取所有URL，然后停止我的实施： class CoreSpider(scrapy.Spider): name = "final" custom_settings = { 'ROBOTSTXT_OBEY': 'False', 'HTTPCACHE_ENABLED': 'True', 'LOG_ENABLED': 'False', 'DOWNLO

我已经实现了一个爬虫程序，它从文本文件中获取URL并刮取所有URL，然后停止

我的实施：

class CoreSpider(scrapy.Spider):
    name = "final"
    custom_settings = {
        'ROBOTSTXT_OBEY': 'False',
        'HTTPCACHE_ENABLED': 'True',
        'LOG_ENABLED': 'False',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'random_useragent.RandomUserAgentMiddleware': 320
        },
    }

    def __init__(self):
        self.all_ngrams = get_ngrams()
        # logging.DEBUG(self.all_ngrams)
        self.search_term = ""
        self.start_urls = self.read_url()
        self.rules = (
            Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
        )
 .....
 .....

我从脚本运行此spider，如下所示：

process = CrawlerProcess(get_project_settings())
process.crawl(CoreSpider)
process.start()

它给出了“错误”

twisted.internet.error.ReactorNotRestartable

一旦完成对所有URL的抓取

我尝试使用下面这样的实现，它给出了和前面相同的错误

runner = CrawlerRunner(get_project_settings())
d = runner.crawl(CoreSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

然后我试着像这样跑蜘蛛：

runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
    yield runner.crawl(CoreSpider)
    reactor.stop()

crawl()
reactor.run()

但它仍然给出了同样的错误

一旦所有URL都被删除，如何手动停止爬行器

更新：Python 2.7堆栈跟踪

Traceback (most recent call last):
  File "seed_list_generator.py", line 768, in <module>
    process = CrawlerProcess(get_project_settings())
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/root/anaconda2/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
    process.start()
  File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

回溯（最近一次呼叫最后一次）：
文件“seed_list_generator.py”，第768行，在
进程=爬网进程（获取项目设置（）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”，第243行，在__
超级（爬虫进程，自我）。\uuuuu初始化\uuuuu（设置）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”，第134行，在__
self.spider\u loader=\u get\u spider\u loader（设置）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”，第330行，在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.（settings.frozencopy（））
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”，第61行，在from_设置中
返回cls（设置）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”，第25行，在__
self.\u加载\u所有\u蜘蛛（）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”，第47行，在所有spider中
对于walk_模块中的模块（名称）：
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/utils/misc.py”，第71行，在walk_模块中
子模块=导入模块（完整路径）
文件“/root/anaconda2/lib/python2.7/importlib/_init__uuu.py”，第37行，在导入模块中
__导入（名称）
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”，第12行，在
process.start（）
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”，第285行，开始
reactor.run（installSignalHandlers=False）#阻止调用
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”，第1242行，运行中
self.startRunning（installSignalHandlers=installSignalHandlers）
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”，第1222行，在startRunning中
反应器基础启动耳轴（自）
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”，第730行，在startRunning中
引发错误。ReactorNotRestartable（）
twisted.internet.error.ReactorNotRestartable

Python 3.6回溯：

 File "seed_list_generator.py", line 769, in <module>
    process = CrawlerProcess(get_project_settings())
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 249, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 336, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/root/anaconda3/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
    process.start()
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

文件“seed\u list\u generator.py”，第769行，在
进程=爬网进程（获取项目设置（）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”，第249行，在__
超级（爬虫进程，自我）。\uuuuu初始化\uuuuu（设置）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”，第137行，在__
self.spider\u loader=\u get\u spider\u loader（设置）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”，第336行，在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.（settings.frozencopy（））
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”，第61行，在from_设置中
返回cls（设置）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”，第25行，在__
self.\u加载\u所有\u蜘蛛（）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”，第47行，在所有spider中
对于walk_模块中的模块（名称）：
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/utils/misc.py”，第71行，在walk_模块中
子模块=导入模块（完整路径）
文件“/root/anaconda3/lib/python3.6/importlib/\uuuu init\uuuuu.py”，第126行，在导入模块中
return _bootstrap._gcd_import（名称[级别：]，包，级别）
文件“”，第978行，在_gcd_import中
文件“”，第961行，在“查找”和“加载”中
文件“”，第950行，在“查找”和“加载”中解锁
文件“”，第655行，已加载
exec_模块中第678行的文件“”
文件“”，第205行，在调用中删除了帧
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”，第12行，在
process.start（）
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”，第291行，开始
reactor.run（installSignalHandlers=False）#阻止调用
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”，第1242行，运行中
self.startRunning（installSignalHandlers=installSignalHandlers）
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”，第1222行，在startRunning中
反应器基础启动耳轴（自）
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”，第730行，在startRunning中
引发错误。ReactorNotRestartable（）
twisted.internet.error.ReactorNotRestartable

提前感谢。

这样更改代码会发生什么

class CoreSpider(scrapy.Spider):
    name = "final"
    custom_settings = {
        'ROBOTSTXT_OBEY': 'False',
        'HTTPCACHE_ENABLED': 'True',
        'LOG_ENABLED': 'False',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'random_useragent.RandomUserAgentMiddleware': 320
        },
    }

    def __init__(self,*args,**kwargs):
        # python 3
        super().__init__(*args,**kwargs)
        # python 2
        # super(CoreSpider, self).__init__(*args, **kwargs)

        self.all_ngrams = get_ngrams()
        # logging.DEBUG(self.all_ngrams)
        self.search_term = ""
        self.start_urls = self.read_url()
        self.rules = (
            Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
        )
 .....
 .....

最后，我将爬虫程序放入

if\uuuuuu name\uuuuuu==“\uuuuu main\uuuuu”

块中，成功地阻止了它

if __name__ == '__main__':
    process = CrawlerProcess(get_project_settings())
    process.crawl(CoreSpider)
    process.start()

一旦爬虫程序完成删除所有URL，它就会优雅地停止爬虫程序。

你能发布异常的完整堆栈跟踪吗？@TarunLalwani，我用回溯错误更新了这个问题。我认为示例代码可能是Python 3和较新版本的scrapy，这可能是导致问题的原因。你能用Python 3试试吗？它不起作用还有Python 3。添加了对Python的回溯。所以我在找到你的帖子之前工作了几个小时。你知道幕后发生了什么吗。如果你把

if\uu name\uu=='\uu main\uu'：