Python 当爬虫突然死亡时,如何保持状态?

Python 当爬虫突然死亡时,如何保持状态?,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,这个问题是关于 我已经按照下面的链接来保持爬虫的状态 现在,当爬虫程序以中断或Ctrl+C正确结束时,这一切都可以正常工作 我注意到,蜘蛛在工作时不能正常关机 您多次按Ctrl+C组合键 服务器容量受到影响 任何其他导致其突然结束的原因 爬行器再次运行时,会在第一个爬网的url上关闭自身 当发生上述情况时,如何实现爬虫的持久状态? 否则它会再次抓取所有的URL 当爬行器再次运行时记录: 2016-08-30 08:14:11 [scrapy] INFO: Scrapy 1.1.2 started

这个问题是关于

我已经按照下面的链接来保持爬虫的状态

现在,当爬虫程序以中断或Ctrl+C正确结束时,这一切都可以正常工作

我注意到,蜘蛛在工作时不能正常关机

  • 您多次按Ctrl+C组合键
  • 服务器容量受到影响
  • 任何其他导致其突然结束的原因
  • 爬行器再次运行时,会在第一个爬网的url上关闭自身

    当发生上述情况时,如何实现爬虫的持久状态? 否则它会再次抓取所有的URL

    当爬行器再次运行时记录:

    2016-08-30 08:14:11 [scrapy] INFO: Scrapy 1.1.2 started (bot: maxverstappen)
    2016-08-30 08:14:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'maxverstappen.spiders', 'SPIDER_MODULES': ['maxverstappen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'maxverstappen'}
    2016-08-30 08:14:11 [scrapy] INFO: Enabled extensions:
    ['scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.spiderstate.SpiderState']
    2016-08-30 08:14:11 [scrapy] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-08-30 08:14:11 [scrapy] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-08-30 08:14:12 [scrapy] INFO: Enabled item pipelines:
    ['maxverstappen.pipelines.MaxverstappenPipeline']
    2016-08-30 08:14:12 [scrapy] INFO: Spider opened
    2016-08-30 08:14:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-08-30 08:14:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/robots.txt> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/robots.txt> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.inautonews.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.newsnow.co.uk': <GET http://www.newsnow.co.uk/h/Life+&+Style/Motoring>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.americanmuscle.com': <GET http://www.americanmuscle.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.extremeterrain.com': <GET http://www.extremeterrain.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.autoanything.com': <GET http://www.autoanything.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.bmwcoop.com': <GET http://www.bmwcoop.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.automotorblog.com': <GET http://www.automotorblog.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/inautonews>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/inautonews>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET https://plus.google.com/+Inautonewsplus>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.histats.com': <GET http://www.histats.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.hamiltonf1site.com': <GET http://www.hamiltonf1site.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.joshwellsracing.com': <GET http://www.joshwellsracing.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jensonbuttonfan.net': <GET http://www.jensonbuttonfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.fernandoalonsofan.net': <GET http://www.fernandoalonsofan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.markwebberfan.net': <GET http://www.markwebberfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.felipemassafan.net': <GET http://www.felipemassafan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nicorosbergfan.net': <GET http://www.nicorosbergfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nickheidfeldfan.net': <GET http://www.nickheidfeldfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.lewishamiltonblog.net': <GET http://www.lewishamiltonblog.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.timoglockfan.net': <GET http://www.timoglockfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jarnotrullifan.net': <GET http://www.jarnotrullifan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.brunosennafan.net': <GET http://www.brunosennafan.net/>
    2016-08-30 08:14:12 [scrapy] INFO: Closing spider (finished)
    2016-08-30 08:14:12 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 896,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 4,
     'downloader/response_bytes': 35353,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 4,
     'dupefilter/filtered': 149,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 724932),
     'log_count/DEBUG': 28,
     'log_count/INFO': 7,
     'offsite/domains': 22,
     'offsite/filtered': 23,
     'request_depth_max': 1,
     'response_received_count': 4,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/disk': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/disk': 2,
     'start_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 13456)}
    2016-08-30 08:14:12 [scrapy] INFO: Spider closed (finished)
    
    2016-08-30 08:14:11[scrapy]信息:scrapy 1.1.2已启动(机器人:maxverstappen)
    2016-08-30 08:14:11[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'maxverstappen.SPIDER','SPIDER_MODULES':['maxverstappen.SPIDER'],'ROBOTSTXT_-obe':True,'BOT_-NAME':'maxverstappen'}
    2016-08-30 08:14:11[scrapy]信息:启用的扩展:
    ['scrapy.extensions.logstats.logstats',
    'scrapy.extensions.telnet.TelnetConsole',
    'scrapy.extensions.corestats.corestats',
    'scrapy.extensions.spiderstate.spiderstate']
    2016-08-30 08:14:11[scrapy]信息:已启用的下载程序中间件:
    ['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
    'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
    'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
    'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
    'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
    'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
    'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
    'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
    'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
    “scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
    'scrapy.DownloaderMiddleware.chunked.ChunkedTransfererMiddleware',
    'scrapy.downloadermiddleware.stats.DownloaderStats']
    2016-08-30 08:14:11[scrapy]信息:启用的蜘蛛中间件:
    ['scrapy.spidermiddleware.httperror.httperror中间件',
    '刮皮.SpiderMiddleware.场外.场外Iddleware',
    “scrapy.Spidermiddleware.referer.RefererMiddleware”,
    'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
    'scrapy.spidermiddleware.depth.DepthMiddleware']
    2016-08-30 08:14:12[scrapy]信息:启用的项目管道:
    ['maxverstappen.pipelines.maxverstappenpippeline']
    2016-08-30 08:14:12[scrapy]信息:蜘蛛打开
    2016-08-30 08:14:12[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
    2016-08-30 08:14:12[scrapy]调试:Telnet控制台监听127.0.0.1:6024
    2016-08-30 08:14:12[scrapy]调试:爬网(200)(参考:无)
    2016-08-30 08:14:12[scrapy]调试:爬网(200)(参考:无)
    2016-08-30 08:14:12[scrapy]调试:爬网(200)(参考:无)
    2016-08-30 08:14:12[scrapy]调试:爬网(200)(参考:无)
    2016-08-30 08:14:12[scrapy]调试:已筛选的重复请求:-将不再显示重复项(请参阅DUPEFILTER_调试以显示所有重复项)
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.newsnow.co.uk”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.americanmuscle.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.extremeterain.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.autoanything.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.bmwcoop.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.automotorblog.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“twitter.com”的异地请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.facebook.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到plus.google.com的异地请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.histats.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.hamiltonf1site.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.joshwellsracing.com”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.jensonbuttonfan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.fernandoalonsofan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.markwebberfan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.felipemassafan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.nicorosbergfan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.nickheidfeldfan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.lewishamiltonblog.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.timoglockfan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.jarnotrullifan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]调试:过滤到“www.brunosennafan.net”的场外请求:
    2016-08-30 08:14:12[scrapy]信息:关闭卡盘(已完成)
    2016-08-30 08:14:12[刮痧]信息:倾销刮痧统计数据:
    {'downloader/request_bytes':896,
    “下载程序/请求计数”:4,
    “下载器/请求\方法\计数/获取”:4,
    “downloader/response_字节”:35353,
    “下载程序/响应计数”:4,
    “下载/响应状态\计数/200”:4,
    “dupefilter/filtered”:149,
    “完成原因”:“完成”,
    “完成时间”:datetime.datetime(2016,8,30,8,14,12,724932),
    “日志计数/调试”:28,
    “日志计数/信息”:7,
    “场外/域”:22,
    “场外/过滤”:23,
    “请求深度最大值”:1,
    “收到的响应数”:4,
    “调度程序/出列”:2,
    “调度程序/出列/磁盘”:2,
    “调度程序/排队”:2,
    “调度程序/排队/磁盘”:2,
    “开始时间”:datetime.datetime(2016,8,30,8,14,12,13456)}
    2016-08-30 08:14:12[刮擦]信息:蜘蛛侠关闭(f