Web scraping 用刮刀刮网
我正在使用这段代码来刮取一个页面,但无法获得“稀有”来刮取。任何帮助都将不胜感激。其他一切似乎都正常,还有谁能告诉我在cardname项目行中的.extract()之后“[0]”有什么作用。对于稀有性字段,我建议:Web scraping 用刮刀刮网,web-scraping,scrapy,Web Scraping,Scrapy,我正在使用这段代码来刮取一个页面,但无法获得“稀有”来刮取。任何帮助都将不胜感激。其他一切似乎都正常,还有谁能告诉我在cardname项目行中的.extract()之后“[0]”有什么作用。对于稀有性字段,我建议: 您将获得包含的的文本表示形式 用正则表达式提取“稀有性:”后面的内容 大概是这样的: from scrapy.spider import BaseSpider from scrapy.selector import Selector from scrapy.exceptions
- 您将获得包含
的
的文本表示形式 - 用正则表达式提取“稀有性:”后面的内容
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from botg.items import BotgItem
URL = "http://store.tcgplayer.com/magic/born-of-the-gods?PageNumber=%d"
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["tcgplayer.com"]
start_urls = [URL % 1]
def __init__(self):
self.page_number = 1
def parse(self, response):
print self.page_number
print "--------------------BREAK-------------------------"
sel = Selector(response)
titles = sel.xpath("//div[@class='magicCard']")
if not titles:
raise CloseSpider('No more pages')
for title in titles:
item = BotgItem()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
item["rarity"] = title.xpath(".//li[@href='/magic/born-of-the-gods']/text()").extract()
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item
self.page_number += 1
yield Request(URL % self.page_number)
关于你的第二个问题,
.extract()
提取一个字符串列表,因此[0]
只需选择该列表的第一个元素你能详细说明一下吗,但我无法获得“稀有性”来刮取
?你看到错误了吗extract()
返回一个列表,因此[0]
返回extract()
输出的第一个元素。
for title in titles:
item = BotgItem()
item["rarity"] = title.xpath('string(.//ul[li[@class="cardName"]])').re(r'Rarity:\s*(\w+)')