Web crawler 同一项目中不能有两个spider?
我能够生成第一个蜘蛛,好吗Web crawler 同一项目中不能有两个spider?,web-crawler,scrapy,Web Crawler,Scrapy,我能够生成第一个蜘蛛,好吗 Thu Feb 27 - 01:59 PM > scrapy genspider confluenceChildPages confluence Created spider 'confluenceChildPages' using template 'crawl' in module: dirbot.spiders.confluenceChildPages 但当我尝试生成另一个spider时,我得到了以下结果: Thu Feb 27 - 01:59 PM
Thu Feb 27 - 01:59 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
dirbot.spiders.confluenceChildPages
但当我尝试生成另一个spider时,我得到了以下结果:
Thu Feb 27 - 01:59 PM > scrapy genspider xxx confluence
Traceback (most recent call last):
File "/usr/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.22.2', 'scrapy')
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/commands/genspider.py", line 68, in run
crawler = self.crawler_process.create_crawler()
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler
self.crawlers[name] = Crawler(self.settings)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 25, in __init__
self.spiders = spman_cls.from_crawler(self)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
for module in walk_modules(name):
File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
from scrapybot.items import ScrapybotItem
ImportError: No module named scrapybot.items
然后我尝试创建两个spider:
scrapy genspider confluenceChildPagesWithTags confluence
scrapy genspider confluenceChildPages confluence
我在第二个genspider
命令上得到了错误
更新:2014年3月5日星期三,下午2:16:07-添加与@Darian答案相关的信息。显示scrapybot仅在第一个genspider命令后弹出
Wed Mar 05 - 02:12 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/spiders
./dirbot/spiders/dmoz.py
./dirbot/spiders/__init__.py
./dirbot/__init__.py
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:13 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
Wed Mar 05 - 02:14 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
dirbot.spiders.confluenceChildPages
Wed Mar 05 - 02:14 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/items.pyc
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/settings.pyc
./dirbot/spiders
./dirbot/spiders/confluenceChildPages.py
./dirbot/spiders/dmoz.py
./dirbot/spiders/dmoz.pyc
./dirbot/spiders/__init__.py
./dirbot/spiders/__init__.pyc
./dirbot/__init__.py
./dirbot/__init__.pyc
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:17 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
./dirbot/spiders/confluenceChildPages.py:from scrapybot.items import ScrapybotItem
./dirbot/spiders/confluenceChildPages.py: i = ScrapybotItem()
新生成的confluenceChildPages.py是:
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapybot.items import ScrapybotItem
class ConfluencechildpagesSpider(CrawlSpider):
name = 'confluenceChildPages'
allowed_domains = ['confluence']
start_urls = ['http://www.confluence/']
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = ScrapybotItem()
#i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = sel.xpath('//div[@id="name"]').extract()
#i['description'] = sel.xpath('//div[@id="description"]').extract()
return i
所以我可以看到它引用了scrapybot,但我不确定如何修复它。。仍然是n00b。显示目录层次结构以获得更好的解决方案。当spider模块的名称与scrapy项目模块的名称相同时,就会出现此问题,因此python试图导入与spider相关的项。因此,请确保您的项目模块和spider模块名称不相同您可以在回溯中看到最后一行:
File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
from scrapybot.items import ScrapybotItem
文件/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821“审查”——“收费”——“在线”——“合流”——“页面/dirbot-master/dirbot/spider/confluenceChildPages.py”,第4行
从scrapybot.items导入ScrapybotItem
这告诉我,您生成的第一个spider“confluenceChildPages”认为它需要从名为scrapybot
的模块导入项目,但这并不存在。如果查看confluenceChildPages.py
内部,您将能够看到导致错误的那一行
实际上,我不确定它是使用哪个设置来生成的,但如果您在项目中查找(grep)scrapybot,您应该找到它的来源,然后可以将其更改为dirbot
,看起来像您想要的模块
然后需要删除它生成的第一个spider并重新生成它。它在您第二次创建时出错,因为它加载到您作为项目一部分生成的第一个spider中,并且由于其中存在导入错误,因此您得到了回溯
干杯。我用您所指的更多信息更新了我的问题。好的,我知道您说生成的代码是在寻找“scrapybot”而不是dirbot。当我检查未修改的项目时,根本没有对scrapybot的引用。我更新了我的问题以显示何时添加了scrapybot引用。
File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
from scrapybot.items import ScrapybotItem