编辑:我如何创建一个";嵌套循环“;将一个项目返回到Python和Scrapy中的原始列表
编辑: 好吧,我今天所做的就是想弄明白这一点,不幸的是,我还没有弄清楚。我现在得到的是:编辑:我如何创建一个";嵌套循环“;将一个项目返回到Python和Scrapy中的原始列表,python,html,loops,web-scraping,scrapy,Python,Html,Loops,Web Scraping,Scrapy,编辑: 好吧,我今天所做的就是想弄明白这一点,不幸的是,我还没有弄清楚。我现在得到的是: import scrapy from Archer.items import ArcherItemGeorges class georges_spider(scrapy.Spider): name = "GEORGES" allowed_domains = ["georges.com.au"] start_urls = ["http://www.georges.com.au/in
import scrapy
from Archer.items import ArcherItemGeorges
class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]
def parse(self,response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
yield scrapy.Request(response.url, callback = self.primary_parse)
yield scrapy.Request(response.url, callback = self.secondary_parse)
def primary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
itemlist = []
product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()
for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
def secondary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
itemlist = []
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()
for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
问题是,我似乎无法进行第二次解析。。。我只能做一次语法分析
不管怎样,是同时进行还是逐步进行两次分析
原件: 我正在慢慢掌握这个窍门(Python和Scrapy),但我曾经遇到过一次困难。我想做的是: 有一个摄影零售网站,它列出的产品如下:
Name of Camera Body
Price
With Such and Such Lens
Price
With Another Such and Such Lens
Price
我想做的是,获取信息并将其组织在如下列表中(我很容易将其输出到csv文件):
我当前的蜘蛛代码:
import scrapy
from Archer.items import ArcherItemGeorges
class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()
subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()
itemlist = []
for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist
这不符合我的要求,我也不知道下一步该怎么做,我尝试在for循环中执行for循环,但没有成功,它只是输出了混合的结果
仅供参考,my items.py:
import scrapy
class ArcherItemGeorges(scrapy.Item):
product = scrapy.Field()
price = scrapy.Field()
subproduct = scrapy.Field()
subprice = scrapy.Field()
如果有任何帮助,我将不胜感激。我正在尽我最大的努力学习Python,但由于我是Python新手,我觉得我需要一些指导。首先,您确定您的设置或项目正确吗
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# Should these be 'subproduct' and 'subprice' ?
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
第二,您可以考虑使用助手函数来执行经常执行的任务
看起来干净一点
def getDollars( price ):
return price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# ...
item['price'] = getDollars( price )
item['subprice'] = getDollars( subprice )
正如你的直觉所说,你正在抓取的元素的结构要求在一个循环中建立一个循环。稍微重新排列一下代码,就可以得到一个包含所有产品子产品的列表 我已使用
产品
重命名了请求
,并引入了子产品
变量,以明确说明问题。我猜子管道
循环可能就是您试图弄清楚的
def parse(self, response):
# Loop all the product elements
for product in response.xpath('//div[@class="listing-item"]'):
item = ArcherItemGeorges()
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
item['product'] = product_name
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the raw primary item
yield item
# Yield the primary item with its secondary items
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
yield item
当然,您需要将所有大写字母、价格清理等应用于相应的字段
简要说明:
下载页面后,将使用响应
对象(HTML页面)调用解析
方法。从响应中
我们必须以项的形式提取/刮取数据。在这种情况下,我们希望返回产品价格项目的列表。下面是表达的魔力开始发挥作用。您可以将其视为一个按需返回
,它不会完成函数的执行,也称为生成器。Scrapy将调用parse
生成器,直到它不再有要生成的项,因此在响应中不再有要刮取的项
注释代码:
def parse(self, response):
# Loop all the product elements, those div elements with a "listing-item" class
for product in response.xpath('//div[@class="listing-item"]'):
# Create an empty item container
item = ArcherItemGeorges()
# Scrape the primary product name and keep in a variable for later use
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
# Fill the 'product' field with the product name
item['product'] = product_name
# Fill the 'price' field with the scraped primary product price
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the primary product item. That with the primary name and price
yield item
# Now, for each product, we need to loop through all the subproducts
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
# Let's prepare a new item with the subproduct appended to the previous
# stored product_name, that is, product + subproduct.
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
# And set the item price field with the subproduct price
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
# Now yield the composed product + subproduct item.
yield item
感谢到目前为止的帮助,但目前它还不能完全按计划工作,我已经更改了我的代码以反映您的代码,甚至在恢复到原始代码之前对其进行了修改,因为它不会输出所需的内容。结果是,输出拾取了我认为是所有次要项并跳过了主要项,次要项与主要项共享相同的名称。现在我猜它不会产生主要项目?我该怎么做?我想我已经得到了你想要的。请参阅更新的代码。顺便说一句,您的items
类令人困惑,因为它包含四个字段,其中两个字段,subproduct
和subprice
,从未使用过。非常感谢!你能解释一下为什么这样做有效吗?我试着简单地评论一下解决方案。希望有帮助。谢谢你的提示,当我回去清理所有正在处理的爬行器时,我将使用帮助函数。
def parse(self, response):
# Loop all the product elements, those div elements with a "listing-item" class
for product in response.xpath('//div[@class="listing-item"]'):
# Create an empty item container
item = ArcherItemGeorges()
# Scrape the primary product name and keep in a variable for later use
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
# Fill the 'product' field with the product name
item['product'] = product_name
# Fill the 'price' field with the scraped primary product price
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the primary product item. That with the primary name and price
yield item
# Now, for each product, we need to loop through all the subproducts
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
# Let's prepare a new item with the subproduct appended to the previous
# stored product_name, that is, product + subproduct.
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
# And set the item price field with the subproduct price
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
# Now yield the composed product + subproduct item.
yield item