Python Scrapy SGMLLinkedExtractor规则和回调让人有些头疼

Python Scrapy SGMLLinkedExtractor规则和回调让人有些头疼,python,scrapy,Python,Scrapy,我正在努力做到: class SpiderSpider(CrawlSpider): name = "lolies" allowed_domains = ["domain.com"] start_urls = ['http://www.domain.com/directory/lol2'] rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlL

我正在努力做到:

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2']
    rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))

def parse_loly(self, response):
    print 'Hi this is the loly page %s' % response.url
    return
这让我想起:

NameError: name 'self' is not defined
如果我将callback改为
callback=“self.parse\u loly”
似乎永远不会被调用并打印URL

但是,似乎是爬行的网站没有问题,因为我得到了许多爬网200调试消息的规则

我可能做错了什么


提前谢谢各位

似乎
parse_loly
的空格没有正确对齐。Python是对空格敏感的,因此对于解释器来说,它看起来像是SpiderSpider之外的一个方法

您可能还希望根据将规则行拆分为较短的行

试试这个:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2/']
    rules = (
        Rule(SgmlLinkExtractor(allow=('\w+$', ))), 
        Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
        Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
    )

    def parse_loly(self, response):
        print 'Hi this is the loly page %s' % response.url
        return None

很抱歉,我在那里留下了一些额外的字符,我已经删除了这些字符。是同一个错误消息吗?NameError:名称“self”未定义。我想我明白了-我通读了上的教程,发现您希望通过将该方法引用为字符串来调用该方法,在本例中是
'parse\u loly'
。我还根据他们的示例优化了一些其他代码,例如使用start_url作为规则允许的基础。另外,如果没有回调,
follow=True
是默认行为。OK!解决了。需要将带有回调的规则移到其他规则之上,因为如果没有在之前找到匹配项。。。只要想一想就行了;)。。。谢谢