Python 无法使用Scrapy跟踪链接_Python_Scrapy

Python 无法使用Scrapy跟踪链接

python scrapy

Python 无法使用Scrapy跟踪链接,python,scrapy,Python,Scrapy,我已经创建了一个扩展爬行蜘蛛的蜘蛛，并按照问题是我需要解析起始url（恰好与主机名重合）和它包含的一些链接因此，我定义了一个类似的规则：rules=[rule（SgmlLinkExtractor（allow=['/page/d+'）），callback='parse_items'，follow=True）]，但什么也没发生然后我尝试定义一组规则，如：rules=[Rule（SgmlLinkExtractor（allow=['/page/d+']）、callback='parse_items

我已经创建了一个扩展爬行蜘蛛的蜘蛛，并按照

问题是我需要解析起始url（恰好与主机名重合）和它包含的一些链接

因此，我定义了一个类似的规则：

rules=[rule（SgmlLinkExtractor（allow=['/page/d+'）），callback='parse_items'，follow=True）]

，但什么也没发生

然后我尝试定义一组规则，如：

rules=[Rule（SgmlLinkExtractor（allow=['/page/d+']）、callback='parse_items'，follow=True）、Rule（SgmlLinkExtractor（allow=['/']）、callback='parse_items'，follow=True）]

。现在的问题是，爬行器解析所有内容

我如何告诉爬行器解析_start_url u以及它包含的一些链接

更新：

我试图覆盖

parse\u start\u url

方法，因此现在我可以从起始页获取数据，但它仍然不遵循使用

规则定义的链接
class ExampleSpider(CrawlSpider):
  name = 'TechCrunchCrawler'
  start_urls = ['http://techcrunch.com']
  allowed_domains = ['techcrunch.com']
  rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]

  def parse_start_url(self, response):
      print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
      return self.parse_links(response)

  def parse_links(self, response):
      print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
      articles = []
      for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
          article = Article()
          article['title'] = i.select('./@title').extract()
          article['link'] = i.select('./@href').extract()
          articles.append(article)

      return articles

我过去也有过类似的问题。

我坚持用BaseSpider
试试这个：
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader

from techCrunch.items import Article


class techCrunch(BaseSpider):
    name = 'techCrunchCrawler'
    allowed_domains = ['techcrunch.com']

    # This gets your start page and directs it to get parse manager
    def start_requests(self):
        return [Request("http://techcrunch.com", callback=self.parseMgr)]

    # the parse manager deals out what to parse and start page extraction
    def parseMgr(self, response):
        print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
        yield self.pageParser(response)

        nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract()
        if nextPage:
            yield Request(nextPage[0], callback=self.parseMgr)

    # The page parser only parses the pages and returns items on each page call
    def pageParser(self, response):
        print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
        loader = XPathItemLoader(item=Article(), response=response)
        loader.add_xpath('title', '//h2[@class="headline"]/a/@title')
        loader.add_xpath('link', '//h2[@class="headline"]/a/@href')
        return loader.load_item()

您忘记将字母d反斜杠转义为\d
：
>>> SgmlLinkExtractor(allow=r'/page/d+').extract_links(response)
[]
>>> SgmlLinkExtractor(allow=r'/page/\d+').extract_links(response)
[Link(url='http://techcrunch.com/page/2/', text=u'Next Page',...)]

你能在这里发布一些你的代码来识别吗