Python 官方刮擦示例中的错误？_Python_Scrapy

Python 官方刮擦示例中的错误？

python scrapy

Python 官方刮擦示例中的错误？,python,scrapy,Python,Scrapy,尝试了屏幕上出现的刮擦用法示例（名称下的示例：从单个回调返回多个请求和项目）我刚将域名改为指向一个真正的网站： import scrapy class MySpider(scrapy.Spider): name = 'huffingtonpost' allowed_domains = ['huffingtonpost.com/'] start_urls = [ 'http://www.huffingtonpost.com/politics/',

尝试了屏幕上出现的刮擦用法示例（名称下的示例：从单个回调返回多个请求和项目）

我刚将域名改为指向一个真正的网站：

import scrapy

class MySpider(scrapy.Spider):
    name = 'huffingtonpost'
    allowed_domains = ['huffingtonpost.com/']
    start_urls = [
        'http://www.huffingtonpost.com/politics/',
        'http://www.huffingtonpost.com/entertainment/',
        'http://www.huffingtonpost.com/media/',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

但获取中发布的

ValuError

。

有什么想法吗？

一些提取的链接是相对的（例如，

/news/hillary clinton/

）。您应该将其转换为绝对值（

http://www.huffingtonpost.com/news/hillary-clinton/

import scrapy

class MySpider(scrapy.Spider):
    name = 'huffingtonpost'
    allowed_domains = ['huffingtonpost.com/']
    start_urls = [
        'http://www.huffingtonpost.com/politics/',
        'http://www.huffingtonpost.com/entertainment/',
        'http://www.huffingtonpost.com/media/',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            if url.startswith('/'):
                # transform url into absolute
                url = 'http://www.huffingtonpost.com' + url
            if url.startswith('#'):
                # ignore href starts with #
                continue
            yield scrapy.Request(url, callback=self.parse)

这对我来说很有效。我在这里使用的是相同版本的2.7.6，在ubuntu 14.04.3上…不知道是什么原因造成的…所以官方的scrapy示例假设所有返回的链接都是绝对的？（这可能是一个错误的假设？）response.xpath（'//a/@href'）.extract（）刚刚从标记中提取了href值。它可以是绝对或相对url，链接到页面中具有指定id的元素（如href=“#top”），也可以是脚本（如href=“javascript:alert（'Hello'）；”）不要不同意你所说的…如果是这样的话，这是不合适的，尤其是作为一个介绍性的例子。我没有看到使用huffingtonpost.com页面的例子。