Python 未调用scrapy parse_项方法
这是我的密码。未调用我的parse_项方法Python 未调用scrapy parse_项方法,python,scrapy,Python,Scrapy,这是我的密码。未调用我的parse_项方法 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector class SjsuSpider(CrawlSpider): name = 'sjsu' allowed_d
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class SjsuSpider(CrawlSpider):
name = 'sjsu'
allowed_domains = ['sjsu.edu']
start_urls = ['http://cs.sjsu.edu/']
# allow=() is used to match all links
rules = [Rule(SgmlLinkExtractor(allow=()), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]
def parse_item(self, response):
print "some message"
open("sjsupages", 'a').write(response.body)
您允许的域应该是
'cs.sjsu.edu'
Scrapy不允许允许的域的子域
此外,您的规则可以写成:
rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')]
是否必须为“允许”指定一个值??我想你的爬行器找不到任何要分析的项目。我不知道…但这是有道理的。如果我想放弃所有东西,我可以在allow中添加什么。如果不是rules=[Rule(sgmlLinkedExtractor(),follow=True,callback=self.parse\u item)]?
self。parse\u item
不起作用,asself
不在类定义的范围内。因此,将“parse_item”作为字符串是合理的。