Python 在Scrapy中使用CSS和Xpath选择器
我正在学习Scrapy官方教程,该教程支持我从中刮取数据,该教程展示了如何使用以下spider刮取数据:Python 在Scrapy中使用CSS和Xpath选择器,python,css,xpath,scrapy,Python,Css,Xpath,Scrapy,我正在学习Scrapy官方教程,该教程支持我从中刮取数据,该教程展示了如何使用以下spider刮取数据: class QuotesSpiderCss(scrapy.Spider): name = "quotes_css" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): quotes = response.css('d
class QuotesSpiderCss(scrapy.Spider):
name = "quotes_css"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
quotes = response.css('div.quote')
for quote in quotes:
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags::text').extract()
}
然后将爬行器爬行到JSON文件,它将返回所查看的内容:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n "]},
...]
我尝试使用xpath而不是css编写相同的爬行器:
class QuotesSpiderXpath(scrapy.Spider):
name = 'quotes_xpath'
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'text': quote.xpath("//span[@class='text']/text()").extract_first(),
'author': quote.xpath("//small[@class='author']/text()").extract_first(),
'tags': quote.xpath("//div[@class='tags']/text()").extract()
}
但是这个蜘蛛返回给我一个列表,上面有同样的引语:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
...]
提前谢谢 之所以总是使用相同的引号,是因为您没有使用相对XPath。看 在XPath语句中添加前缀点,如以下解析方法所示:
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'text': quote.xpath(".//span[@class='text']/text()").extract_first(),
'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
'tags': quote.xpath(".//div[@class='tags']/text()").extract()
}