Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址?
我已经实现了一个爬虫程序,它从文本文件中获取URL并刮取所有URL,然后停止 我的实施:Python 如何手动停止刮擦爬虫一旦它刮掉所有提供的网址?,python,scrapy,Python,Scrapy,我已经实现了一个爬虫程序,它从文本文件中获取URL并刮取所有URL,然后停止 我的实施: class CoreSpider(scrapy.Spider): name = "final" custom_settings = { 'ROBOTSTXT_OBEY': 'False', 'HTTPCACHE_ENABLED': 'True', 'LOG_ENABLED': 'False', 'DOWNLO
class CoreSpider(scrapy.Spider):
name = "final"
custom_settings = {
'ROBOTSTXT_OBEY': 'False',
'HTTPCACHE_ENABLED': 'True',
'LOG_ENABLED': 'False',
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
},
}
def __init__(self):
self.all_ngrams = get_ngrams()
# logging.DEBUG(self.all_ngrams)
self.search_term = ""
self.start_urls = self.read_url()
self.rules = (
Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
)
.....
.....
我从脚本运行此spider,如下所示:
process = CrawlerProcess(get_project_settings())
process.crawl(CoreSpider)
process.start()
它给出了“错误”
twisted.internet.error.ReactorNotRestartable
一旦完成对所有URL的抓取
我尝试使用下面这样的实现,它给出了和前面相同的错误
runner = CrawlerRunner(get_project_settings())
d = runner.crawl(CoreSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
然后我试着像这样跑蜘蛛:
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
yield runner.crawl(CoreSpider)
reactor.stop()
crawl()
reactor.run()
但它仍然给出了同样的错误
一旦所有URL都被删除,如何手动停止爬行器
更新:Python 2.7堆栈跟踪
Traceback (most recent call last):
File "seed_list_generator.py", line 768, in <module>
process = CrawlerProcess(get_project_settings())
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/root/anaconda2/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
process.start()
File "/root/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/root/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
回溯(最近一次呼叫最后一次):
文件“seed_list_generator.py”,第768行,在
进程=爬网进程(获取项目设置()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第243行,在__
超级(爬虫进程,自我)。\uuuuu初始化\uuuuu(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第134行,在__
self.spider\u loader=\u get\u spider\u loader(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第330行,在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.(settings.frozencopy())
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第61行,在from_设置中
返回cls(设置)
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第25行,在__
self.\u加载\u所有\u蜘蛛()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/spiderloader.py”,第47行,在所有spider中
对于walk_模块中的模块(名称):
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/utils/misc.py”,第71行,在walk_模块中
子模块=导入模块(完整路径)
文件“/root/anaconda2/lib/python2.7/importlib/_init__uuu.py”,第37行,在导入模块中
__导入(名称)
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”,第12行,在
process.start()
文件“/root/anaconda2/lib/python2.7/site packages/scrapy/crawler.py”,第285行,开始
reactor.run(installSignalHandlers=False)#阻止调用
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第1242行,运行中
self.startRunning(installSignalHandlers=installSignalHandlers)
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第1222行,在startRunning中
反应器基础启动耳轴(自)
文件“/root/anaconda2/lib/python2.7/site packages/twisted/internet/base.py”,第730行,在startRunning中
引发错误。ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Python 3.6回溯:
File "seed_list_generator.py", line 769, in <module>
process = CrawlerProcess(get_project_settings())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 249, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 336, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/root/anaconda3/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 978, in _gcd_import
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
File "/root/Public/company_profiler/profiler/spiders/run_spider.py", line 12, in <module>
process.start()
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 291, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/root/anaconda3/lib/python3.6/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
文件“seed\u list\u generator.py”,第769行,在
进程=爬网进程(获取项目设置()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第249行,在__
超级(爬虫进程,自我)。\uuuuu初始化\uuuuu(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第137行,在__
self.spider\u loader=\u get\u spider\u loader(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第336行,在“获取蜘蛛”加载程序中
从\u设置返回加载程序\u cls.(settings.frozencopy())
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第61行,在from_设置中
返回cls(设置)
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第25行,在__
self.\u加载\u所有\u蜘蛛()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/spiderloader.py”,第47行,在所有spider中
对于walk_模块中的模块(名称):
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/utils/misc.py”,第71行,在walk_模块中
子模块=导入模块(完整路径)
文件“/root/anaconda3/lib/python3.6/importlib/\uuuu init\uuuuu.py”,第126行,在导入模块中
return _bootstrap._gcd_import(名称[级别:],包,级别)
文件“”,第978行,在_gcd_import中
文件“”,第961行,在“查找”和“加载”中
文件“”,第950行,在“查找”和“加载”中解锁
文件“”,第655行,已加载
exec_模块中第678行的文件“”
文件“”,第205行,在调用中删除了帧
文件“/root/Public/company_profiler/profiler/spider/run_spider.py”,第12行,在
process.start()
文件“/root/anaconda3/lib/python3.6/site packages/scrapy/crawler.py”,第291行,开始
reactor.run(installSignalHandlers=False)#阻止调用
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第1242行,运行中
self.startRunning(installSignalHandlers=installSignalHandlers)
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第1222行,在startRunning中
反应器基础启动耳轴(自)
文件“/root/anaconda3/lib/python3.6/site packages/twisted/internet/base.py”,第730行,在startRunning中
引发错误。ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
提前感谢。这样更改代码会发生什么
class CoreSpider(scrapy.Spider):
name = "final"
custom_settings = {
'ROBOTSTXT_OBEY': 'False',
'HTTPCACHE_ENABLED': 'True',
'LOG_ENABLED': 'False',
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
},
}
def __init__(self,*args,**kwargs):
# python 3
super().__init__(*args,**kwargs)
# python 2
# super(CoreSpider, self).__init__(*args, **kwargs)
self.all_ngrams = get_ngrams()
# logging.DEBUG(self.all_ngrams)
self.search_term = ""
self.start_urls = self.read_url()
self.rules = (
Rule(LinkExtractor(unique=True), callback='parse', follow=True, process_request='process_request'),
)
.....
.....
最后,我将爬虫程序放入
if\uuuuuu name\uuuuuu==“\uuuuu main\uuuuu”
块中,成功地阻止了它
if __name__ == '__main__':
process = CrawlerProcess(get_project_settings())
process.crawl(CoreSpider)
process.start()
一旦爬虫程序完成删除所有URL,它就会优雅地停止爬虫程序。你能发布异常的完整堆栈跟踪吗?@TarunLalwani,我用回溯错误更新了这个问题。我认为示例代码可能是Python 3和较新版本的scrapy,这可能是导致问题的原因。你能用Python 3试试吗?它不起作用还有Python 3。添加了对Python的回溯。所以我在找到你的帖子之前工作了几个小时。你知道幕后发生了什么吗。如果你把
if\uu name\uu=='\uu main\uu':