Python Scrapy解析URL列表，逐个打开并解析其他数据_Python_Parsing_Web Scraping_Scrapy

Python Scrapy解析URL列表，逐个打开并解析其他数据

python parsing web-scraping scrapy

Python Scrapy解析URL列表，逐个打开并解析其他数据,python,parsing,web-scraping,scrapy,Python,Parsing,Web Scraping,Scrapy,我试图解析一个网站，一个电子商店。我用ajax加载的产品解析页面，获取这些产品的URL，然后在这些URL之后解析每个产品的附加信息我的脚本获取页面上前四个项目的列表，它们的URL，发出请求，解析添加信息，但不返回到循环中，因此spider关闭有人能帮我解决这个问题吗？我对这类东西还不太熟悉，当我完全陷入困境时，我会在这里问这是我的密码： from scrapy import Spider from scrapy.selector import Selector from scrapy.ht

我试图解析一个网站，一个电子商店。我用ajax加载的产品解析页面，获取这些产品的URL，然后在这些URL之后解析每个产品的附加信息

我的脚本获取页面上前四个项目的列表，它们的URL，发出请求，解析添加信息，但不返回到循环中，因此spider关闭

有人能帮我解决这个问题吗？我对这类东西还不太熟悉，当我完全陷入困境时，我会在这里问

这是我的密码：

from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy_sokos.items import SokosItem


class SokosSpider(Spider):
    name = "sokos"
    allowed_domains = ["sokos.fi"]
    base_url = "http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=%s&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151"
    start_urls = [
        "http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=0&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151",
    ]

    for i in range(0, 8, 4):
        start_urls.append((base_url) % str(i))


    def parse(self, response):
        products = Selector(response).xpath('//div[@class="product-listing product-grid"]/article[@class="product product-thumbnail"]')
        for product in products:
            item = SokosItem()
            item['url'] = product.xpath('//div[@class="content"]/a[@class="image"]/@href').extract()[0]

            yield Request(url = item['url'], meta = {'item': item}, callback=self.parse_additional_info) 

    def parse_additional_info(self, response):
        item = response.meta['item']
        item['name'] = Selector(response).xpath('//h1[@class="productTitle"]/text()').extract()[0].strip()
        item['description'] = Selector(response).xpath('//div[@id="kuvaus"]/p/text()').extract()[0]
        euro = Selector(response).xpath('//strong[@class="special-price"]/span[@class="euros"]/text()').extract()[0]
        cent = Selector(response).xpath('//strong[@class="special-price"]/span[@class="cents"]/text()').extract()[0]
        item['price'] = '.'.join(euro + cent)
        item['number'] = Selector(response).xpath('//@data-productid').extract()[0]
        yield item

您模拟的AJAX请求被粗糙的“复制url过滤器”捕获

生成

请求时，将不过滤设置为真：
yield Request(url=item['url'], 
              meta={'item': item},    
              callback=self.parse_additional_info, 
              dont_filter=True)

但我还是不能让它正常工作。当我在shell中尝试时，它首先返回一个URL列表（四个链接）。然后，替换请求中的每个链接，解析每个页面中的add数据。我尝试了请求（url=item['url'][0]等），但没有更改=0@deniskrishna嗯，这肯定会有所不同。这对我来说很有效，这意味着现在我看到了为附加信息发布的多个URL。也有一些例外，但这不是问题的一部分。谢谢