编辑:我如何创建一个";嵌套循环“;将一个项目返回到Python和Scrapy中的原始列表

编辑:我如何创建一个";嵌套循环“;将一个项目返回到Python和Scrapy中的原始列表,python,html,loops,web-scraping,scrapy,Python,Html,Loops,Web Scraping,Scrapy,编辑: 好吧,我今天所做的就是想弄明白这一点,不幸的是,我还没有弄清楚。我现在得到的是: import scrapy from Archer.items import ArcherItemGeorges class georges_spider(scrapy.Spider): name = "GEORGES" allowed_domains = ["georges.com.au"] start_urls = ["http://www.georges.com.au/in

编辑:

好吧,我今天所做的就是想弄明白这一点,不幸的是,我还没有弄清楚。我现在得到的是:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist
问题是,我似乎无法进行第二次解析。。。我只能做一次语法分析

不管怎样,是同时进行还是逐步进行两次分析


原件:

我正在慢慢掌握这个窍门(Python和Scrapy),但我曾经遇到过一次困难。我想做的是:

有一个摄影零售网站,它列出的产品如下:

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price
我想做的是,获取信息并将其组织在如下列表中(我很容易将其输出到csv文件):

我当前的蜘蛛代码:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist
这不符合我的要求,我也不知道下一步该怎么做,我尝试在for循环中执行for循环,但没有成功,它只是输出了混合的结果

仅供参考,my items.py:

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

如果有任何帮助,我将不胜感激。我正在尽我最大的努力学习Python,但由于我是Python新手,我觉得我需要一些指导。

首先,您确定您的设置或项目正确吗

item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# Should these be 'subproduct' and 'subprice' ? 
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
第二,您可以考虑使用助手函数来执行经常执行的任务 看起来干净一点

def getDollars( price ): 
    return price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')

# ... 
item['price'] = getDollars( price ) 
item['subprice'] = getDollars( subprice )

正如你的直觉所说,你正在抓取的元素的结构要求在一个循环中建立一个循环。稍微重新排列一下代码,就可以得到一个包含所有产品子产品的列表

我已使用
产品
重命名了
请求
,并引入了
子产品
变量,以明确说明问题。我猜
子管道
循环可能就是您试图弄清楚的

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item
当然,您需要将所有大写字母、价格清理等应用于相应的字段

简要说明:

下载页面后,将使用
响应
对象(HTML页面)调用
解析
方法。从
响应中
我们必须以
项的形式提取/刮取数据。在这种情况下,我们希望返回产品价格项目的列表。下面是表达的魔力开始发挥作用。您可以将其视为一个按需
返回
,它不会完成函数的执行,也称为生成器。Scrapy将调用
parse
生成器,直到它不再有要生成的
项,因此在
响应中不再有要刮取的

注释代码:

def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

感谢到目前为止的帮助,但目前它还不能完全按计划工作,我已经更改了我的代码以反映您的代码,甚至在恢复到原始代码之前对其进行了修改,因为它不会输出所需的内容。结果是,输出拾取了我认为是所有次要项并跳过了主要项,次要项与主要项共享相同的名称。现在我猜它不会产生主要项目?我该怎么做?我想我已经得到了你想要的。请参阅更新的代码。顺便说一句,您的
items
类令人困惑,因为它包含四个字段,其中两个字段,
subproduct
subprice
,从未使用过。非常感谢!你能解释一下为什么这样做有效吗?我试着简单地评论一下解决方案。希望有帮助。谢谢你的提示,当我回去清理所有正在处理的爬行器时,我将使用帮助函数。
def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item