Python 在Scrapy中使用CSS和Xpath选择器_Python_Css_Xpath_Scrapy

Python 在Scrapy中使用CSS和Xpath选择器

python css xpath scrapy

Python 在Scrapy中使用CSS和Xpath选择器,python,css,xpath,scrapy,Python,Css,Xpath,Scrapy,我正在学习Scrapy官方教程，该教程支持我从中刮取数据，该教程展示了如何使用以下spider刮取数据： class QuotesSpiderCss(scrapy.Spider): name = "quotes_css" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): quotes = response.css('d

我正在学习Scrapy官方教程，该教程支持我从中刮取数据，该教程展示了如何使用以下spider刮取数据：

class QuotesSpiderCss(scrapy.Spider):
    name = "quotes_css"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags::text').extract()
            }

然后将爬行器爬行到JSON文件，它将返回所查看的内容：

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

我尝试使用xpath而不是css编写相同的爬行器：

class QuotesSpiderXpath(scrapy.Spider):
    name = 'quotes_xpath'
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'text': quote.xpath("//span[@class='text']/text()").extract_first(),
                'author': quote.xpath("//small[@class='author']/text()").extract_first(),
                'tags': quote.xpath("//div[@class='tags']/text()").extract()
            }

但是这个蜘蛛返回给我一个列表，上面有同样的引语：

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

提前谢谢

之所以总是使用相同的引号，是因为您没有使用相对XPath。看

在XPath语句中添加前缀点，如以下解析方法所示：

def parse(self, response):
    quotes = response.xpath('//div[@class="quote"]')
    for quote in quotes:
        yield {
            'text': quote.xpath(".//span[@class='text']/text()").extract_first(),
            'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
            'tags': quote.xpath(".//div[@class='tags']/text()").extract()
        }