Python 无法使用Scrapy跟踪链接
我已经创建了一个扩展爬行蜘蛛的蜘蛛,并按照 问题是我需要解析起始url(恰好与主机名重合)和它包含的一些链接 因此,我定义了一个类似的规则:Python 无法使用Scrapy跟踪链接,python,scrapy,Python,Scrapy,我已经创建了一个扩展爬行蜘蛛的蜘蛛,并按照 问题是我需要解析起始url(恰好与主机名重合)和它包含的一些链接 因此,我定义了一个类似的规则:rules=[rule(SgmlLinkExtractor(allow=['/page/d+')),callback='parse_items',follow=True)],但什么也没发生 然后我尝试定义一组规则,如:rules=[Rule(SgmlLinkExtractor(allow=['/page/d+'])、callback='parse_items
rules=[rule(SgmlLinkExtractor(allow=['/page/d+')),callback='parse_items',follow=True)]
,但什么也没发生
然后我尝试定义一组规则,如:rules=[Rule(SgmlLinkExtractor(allow=['/page/d+'])、callback='parse_items',follow=True)、Rule(SgmlLinkExtractor(allow=['/'])、callback='parse_items',follow=True)]
。现在的问题是,爬行器解析所有内容
我如何告诉爬行器解析_start_url u以及它包含的一些链接
更新:
我试图覆盖parse\u start\u url
方法,因此现在我可以从起始页获取数据,但它仍然不遵循使用规则定义的链接
class ExampleSpider(CrawlSpider):
name = 'TechCrunchCrawler'
start_urls = ['http://techcrunch.com']
allowed_domains = ['techcrunch.com']
rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]
def parse_start_url(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
return self.parse_links(response)
def parse_links(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
articles = []
for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
article = Article()
article['title'] = i.select('./@title').extract()
article['link'] = i.select('./@href').extract()
articles.append(article)
return articles
我过去也有过类似的问题。
我坚持用BaseSpider
试试这个:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader
from techCrunch.items import Article
class techCrunch(BaseSpider):
name = 'techCrunchCrawler'
allowed_domains = ['techcrunch.com']
# This gets your start page and directs it to get parse manager
def start_requests(self):
return [Request("http://techcrunch.com", callback=self.parseMgr)]
# the parse manager deals out what to parse and start page extraction
def parseMgr(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
yield self.pageParser(response)
nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract()
if nextPage:
yield Request(nextPage[0], callback=self.parseMgr)
# The page parser only parses the pages and returns items on each page call
def pageParser(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
loader = XPathItemLoader(item=Article(), response=response)
loader.add_xpath('title', '//h2[@class="headline"]/a/@title')
loader.add_xpath('link', '//h2[@class="headline"]/a/@href')
return loader.load_item()
您忘记将字母d反斜杠转义为\d
:
>>> SgmlLinkExtractor(allow=r'/page/d+').extract_links(response)
[]
>>> SgmlLinkExtractor(allow=r'/page/\d+').extract_links(response)
[Link(url='http://techcrunch.com/page/2/', text=u'Next Page',...)]
你能在这里发布一些你的代码来识别吗