编辑：我如何创建一个"；嵌套循环“；将一个项目返回到Python和Scrapy中的原始列表_Python_Html_Loops_Web Scraping_Scrapy

编辑：我如何创建一个"；嵌套循环“；将一个项目返回到Python和Scrapy中的原始列表

python html loops web-scraping scrapy

编辑：我如何创建一个"；嵌套循环“；将一个项目返回到Python和Scrapy中的原始列表,python,html,loops,web-scraping,scrapy,Python,Html,Loops,Web Scraping,Scrapy,编辑：好吧，我今天所做的就是想弄明白这一点，不幸的是，我还没有弄清楚。我现在得到的是： import scrapy from Archer.items import ArcherItemGeorges class georges_spider(scrapy.Spider): name = "GEORGES" allowed_domains = ["georges.com.au"] start_urls = ["http://www.georges.com.au/in

编辑：

好吧，我今天所做的就是想弄明白这一点，不幸的是，我还没有弄清楚。我现在得到的是：

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

问题是，我似乎无法进行第二次解析。。。我只能做一次语法分析

不管怎样，是同时进行还是逐步进行两次分析

原件：

我正在慢慢掌握这个窍门（Python和Scrapy），但我曾经遇到过一次困难。我想做的是：

有一个摄影零售网站，它列出的产品如下：

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price

我想做的是，获取信息并将其组织在如下列表中（我很容易将其输出到csv文件）：

我当前的蜘蛛代码：

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

这不符合我的要求，我也不知道下一步该怎么做，我尝试在for循环中执行for循环，但没有成功，它只是输出了混合的结果

仅供参考，my items.py：

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

如果有任何帮助，我将不胜感激。我正在尽我最大的努力学习Python，但由于我是Python新手，我觉得我需要一些指导。

首先，您确定您的设置或项目正确吗

item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# Should these be 'subproduct' and 'subprice' ? 
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)

第二，您可以考虑使用助手函数来执行经常执行的任务看起来干净一点

def getDollars( price ): 
    return price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')

# ... 
item['price'] = getDollars( price ) 
item['subprice'] = getDollars( subprice )

正如你的直觉所说，你正在抓取的元素的结构要求在一个循环中建立一个循环。稍微重新排列一下代码，就可以得到一个包含所有产品子产品的列表

我已使用

产品

重命名了

请求

，并引入了

子产品

变量，以明确说明问题。我猜

子管道

循环可能就是您试图弄清楚的

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item

当然，您需要将所有大写字母、价格清理等应用于相应的字段

简要说明：

下载页面后，将使用

响应

对象（HTML页面）调用

解析

方法。从

响应中

我们必须以

项的形式提取/刮取数据。在这种情况下，我们希望返回产品价格项目的列表。下面是表达的魔力开始发挥作用。您可以将其视为一个按需返回
，它不会完成函数的执行，也称为生成器。Scrapy将调用parse
生成器，直到它不再有要生成的项，因此在响应中不再有要刮取的项
注释代码：
def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

感谢到目前为止的帮助，但目前它还不能完全按计划工作，我已经更改了我的代码以反映您的代码，甚至在恢复到原始代码之前对其进行了修改，因为它不会输出所需的内容。结果是，输出拾取了我认为是所有次要项并跳过了主要项，次要项与主要项共享相同的名称。现在我猜它不会产生主要项目？我该怎么做？我想我已经得到了你想要的。请参阅更新的代码。顺便说一句，您的items
类令人困惑，因为它包含四个字段，其中两个字段，subproduct
和subprice，从未使用过。非常感谢！你能解释一下为什么这样做有效吗？我试着简单地评论一下解决方案。希望有帮助。谢谢你的提示，当我回去清理所有正在处理的爬行器时，我将使用帮助函数。
def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item