Python Scrapy SGMLLinkedExtractor规则和回调让人有些头疼
我正在努力做到:Python Scrapy SGMLLinkedExtractor规则和回调让人有些头疼,python,scrapy,Python,Scrapy,我正在努力做到: class SpiderSpider(CrawlSpider): name = "lolies" allowed_domains = ["domain.com"] start_urls = ['http://www.domain.com/directory/lol2'] rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlL
class SpiderSpider(CrawlSpider):
name = "lolies"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/directory/lol2']
rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))
def parse_loly(self, response):
print 'Hi this is the loly page %s' % response.url
return
这让我想起:
NameError: name 'self' is not defined
如果我将callback改为callback=“self.parse\u loly”
似乎永远不会被调用并打印URL
但是,似乎是爬行的网站没有问题,因为我得到了许多爬网200调试消息的规则
我可能做错了什么
提前谢谢各位 似乎
parse_loly
的空格没有正确对齐。Python是对空格敏感的,因此对于解释器来说,它看起来像是SpiderSpider之外的一个方法
您可能还希望根据将规则行拆分为较短的行
试试这个:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class SpiderSpider(CrawlSpider):
name = "lolies"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/directory/lol2/']
rules = (
Rule(SgmlLinkExtractor(allow=('\w+$', ))),
Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
)
def parse_loly(self, response):
print 'Hi this is the loly page %s' % response.url
return None
很抱歉,我在那里留下了一些额外的字符,我已经删除了这些字符。是同一个错误消息吗?NameError:名称“self”未定义。我想我明白了-我通读了上的教程,发现您希望通过将该方法引用为字符串来调用该方法,在本例中是
'parse\u loly'
。我还根据他们的示例优化了一些其他代码,例如使用start_url作为规则允许的基础。另外,如果没有回调,follow=True
是默认行为。OK!解决了。需要将带有回调的规则移到其他规则之上,因为如果没有在之前找到匹配项。。。只要想一想就行了;)。。。谢谢