Python Scrapy Ruel LinkedExtractor下一页_Python_Web Scraping_Scrapy_Scrapy Spider

Python Scrapy Ruel LinkedExtractor下一页

python web-scraping scrapy

Python Scrapy Ruel LinkedExtractor下一页,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正在构建一个Spider，它遍历多个分页页面并从站点提取数据：这是蜘蛛： # -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import Rule from scrapy.linkextractors import LinkExtractor from lxml import html from usnews.items import UsnewsItem class UniversitiesSpider

我正在构建一个Spider，它遍历多个分页页面并从站点提取数据：

这是蜘蛛：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from usnews.items import UsnewsItem


class UniversitiesSpider(scrapy.Spider):
    name = "universities"
    allowed_domains = ["usnews.com"]
    start_urls = (
        'http://www.usnews.com/education/best-global-universities/neuroscience-behavior/',
        )

    #Rules = [
    #Rule(LinkExtractor(allow=(), restrict_xpaths=('.//a[@class="pager_link"]',)), callback="parse", follow= True)
    #]

    def parse(self, response):
        for sel in response.xpath('.//div[@class="sep"]'):
            item = UsnewsItem()
            item['name'] = sel.xpath('.//h2[@class="h-taut"]/a/text()').extract()
            item['location'] = sel.xpath('.//span[@class="t-dim t-small"]/text()').extract()
            item['ranking'] = sel.xpath('.//div[3]/div[2]/text()').extract()
            item['score'] = sel.xpath('.//div[@class="t-large t-strong t-constricted"]/text()').extract()
            #print(sel.xpath('.//text()').extract()
            yield item

遍历分页的规则似乎没有任何作用，因为代码只是为第一个站点吐出数据。我如何才能正确地执行该规则，以便爬行器遍历所有15页并从站点中提取4项（名称、位置、排名、分数）？

看起来它可能与文档中所述的警告有关。具体来说：您编写的

规则将回调定义为parse
，但警告明确表示要避免这样做，因为如果随后覆盖parse
方法（如您在spider中所做的），spider将不再工作
他们在文档中给出了一个关于如何定义要使用的自定义回调的示例（基本上只是不将其命名为parse
）
另外，为了确保这一点，我假设在实际运行时，您取消了对规则的注释
，因为它当前已被注释，不会在您发布的代码中运行。
将callback='parse'更改为callback='parse'u start\u url'，因为爬行蜘蛛类默认有一个解析方法，所以避免重复使用它。
要使用规则属性（如“规则”，而不是代码中的“规则”），您需要对scrapy.Crawlspider
进行子类化，不scrapy.Spider
。正如@steve在他的回答中所说的，你不应该重新定义CrawlSpider
的parse
方法，因为这就是规则的“魔力”所在。