Python 我怎样才能跳转到下一页的规则_Python_Web Scraping_Web Crawler_Scrapy

Python 我怎样才能跳转到下一页的规则

python web-scraping web-crawler scrapy

Python 我怎样才能跳转到下一页的规则,python,web-scraping,web-crawler,scrapy,Python,Web Scraping,Web Crawler,Scrapy,我已经设置了从start_url获取下一页的规则，但它不起作用，它只对start_url页面和该页面中的链接（使用parseLinks）进行爬网。它不会转到规则中设置的下一页有什么帮助吗 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector

我已经设置了从start_url获取下一页的规则，但它不起作用，它只对start_url页面和该页面中的链接（使用parseLinks）进行爬网。它不会转到规则中设置的下一页

有什么帮助吗

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'testes2'
    allowed_domains = ['example.com']
    start_urls = [
    'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)

def parse(self, response):
     sel = Selector(response)
     urls = sel.xpath('//div[@id="btReserve"]/../@href').extract()
     for url in urls:
        url = urljoin(response.url, url)
        self.log('URLS: %s' % url)
        yield Request(url, callback = self.parseLinks)

def parseLinks(self, response):
    sel = Selector(response)
    titulo = sel.xpath('h1/text()').extract()
    morada = sel.xpath('//div[@class="MORADA"]/text()').extract()
    email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract()
    url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract()
    telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract()
    fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract()
    descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract()
    gps = sel.xpath('//td[@class="sendGps"]/@style').extract()

    print titulo, email, morada

您不应该覆盖

CrawlSpider

中的

parse

方法，否则将不遵循

规则
请参阅上的警告
在编写爬行爬行器规则时，避免使用parse作为回调，因为爬行爬行器使用parse方法本身来实现其逻辑。因此，如果重写解析方法，爬行爬行器将不再工作
您正在使用Spider类流：
class MySpider(CrawlSpider): is not the proper class
    instead of this use : class MySpider(Spider)
name = 'testes2'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

In Spider Class you do not need rules. So discard it.
"Not Usable in Spider Class" rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)

我已将parse更改为parsePage，并将Rule callback设置为callback='parsePage'，并且知道它不会进入def parsePage尝试使用restrict\u xpaths=（'//a[@id=“seguinte”]'）、callback='parsePage'，follow=True），
检查此答案，这将解决问题：