Python 刮擦
我试图用scrapy更深入地挖掘,但只能得到我正在刮的东西的标题,而不能得到任何细节。以下是我目前掌握的代码:Python 刮擦,python,web-scraping,scrapy,screen-scraping,scrapy-spider,Python,Web Scraping,Scrapy,Screen Scraping,Scrapy Spider,我试图用scrapy更深入地挖掘,但只能得到我正在刮的东西的标题,而不能得到任何细节。以下是我目前掌握的代码: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from tcgplayer1.items import Tcgplayer1Item class MySpider(BaseSpider): name = "tcg" allowed_domains =
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div[@class='magicCard']")
vendor = hxs.select("//tr[@class='vendor']")
items = []
for titles in titles:
item = Tcgplayer1Item()
item ["cardname"] = titles.select("//li[@class='cardName']/a/text()").extract()
item ["price"] = vendor.select("//td[@class='price']/br/text()").extract()
item ["quantity"] = vendor.select("//td[@class='quantity']/td/text()").extract()
items.append(item)
return items
我无法获得显示任何结果的价格和数量。每张卡都有几个供应商,每个供应商都有自己的价格和数量。我想这就是我的问题所在。任何帮助都将不胜感激 首先,你可以改变
item ["price"] = vendor.select("//td[@class='price']/br/text()").extract()
item ["quantity"] = vendor.select("//td[@class='quantity']/td/text()").extract()
致:
这将确保您只获得所需卡的价格和数量行
您可能还需要从选择器中删除/br和/td,因此您的代码如下所示:
item ["price"] = titles.select("//td[@class='price']/text()").extract()
item ["quantity"] = titles.select("//td[@class='quantity']/text()").extract()
首先,下面是代码的固定版本:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
yield item
代码存在多个问题:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
yield item
类名需要包含一个尾随空格:“vendor”-很难找到vendor
- 每个项目有多个供应商-您需要在循环中定义
vendor
- 您正在重新定义循环中的
标题
变量
- 循环中的xpath表达式应该是相对的
/
- 使用
而不是不推荐使用的选择器
htmlxpath选择器
- 使用
而不是不推荐的xpath()
select()
- 使用
normalize-space()