Python 我需要帮助刮一个aspx网站_Python_Asp.net_Web Scraping_Scrapy_Web Crawler

Python 我需要帮助刮一个aspx网站

python asp.net web-scraping scrapy web-crawler

Python 我需要帮助刮一个aspx网站,python,asp.net,web-scraping,scrapy,web-crawler,Python,Asp.net,Web Scraping,Scrapy,Web Crawler,我目前正试图从一家超市的不同类别中获取产品的主要信息（名称、价格和图像url），但我在页面上挣扎，因为我似乎无法直接访问分类url，它总是将我重定向到主页我要刮的页面是：（这是主页）但我想访问“Bebidas”类别的不同子类别的页面。子类别的url如下所示：只有id会改变，但是当我在子类别url中运行spider时，我会得到主页作为响应。对不起，如果我不够清楚，任何帮助都将非常感谢这是我的蜘蛛： from scrapy.spiders import CrawlSpider from s

我目前正试图从一家超市的不同类别中获取产品的主要信息（名称、价格和图像url），但我在页面上挣扎，因为我似乎无法直接访问分类url，它总是将我重定向到主页

我要刮的页面是：（这是主页）但我想访问“Bebidas”类别的不同子类别的页面。子类别的url如下所示：

只有id会改变，但是当我在子类别url中运行spider时，我会得到主页作为响应。对不起，如果我不够清楚，任何帮助都将非常感谢

这是我的蜘蛛：

from scrapy.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector
from ..items import ProductoGenericoItem


class VeaSpider(CrawlSpider):
    name = "vea"

    pos = 1
    base_url = "https://www.veadigital.com.ar/Comprar/Home.aspx#_atCategory=false&_atGrilla=true&_id={0}"
    c = 0

    cat = [
        141446, # a base de hierbas
        446126, # aguas sin gas
        446127, # aguas con gas
        446128, # aguas saborizadas
        141231, # aperitivos
        141236, # gaseosas cola
    ]

    start_urls = [
        base_url.format(cat[c])
    ]

    def parse(self, response):
        item = ProductoGenericoItem()

        product_info = response.xpath("//li[@class='grilla-producto-container full-layout']").getall()
        for p in product_info:
            sel = Selector(text=p)

            item['repetido'] = False
            item['superMercado'] = 'Vea Argentina'
            item['sucursal'] = 'NO'
            item['marca'] = ''
            item['empresa'] = ''
            item['ean'] = ''
            item['sku'] = ''
            item['idArticulo'] = ''
            item['nombre'] = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div//text())"
            ).get()
            item['descripcion'] = ''
            precio = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[2]/text())"
            ).get()
            centavos = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[2]/span/text())"
            ).get()
            item['precio'] = precio + ',' + centavos
            item['precioPromocional'] = ''
            item['condicion'] = ''
            item['precioPorMedida'] = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[3]/text())"
            ).get()
            item['stock'] = ''
            item['categoria'] = 'Bebidas'
            item['subcategoria'] = response.xpath(
                "normalize-space(//div[@class='category-breadcrumbs']/a//text())"
            )
            item['segmento'] = response.xpath(
                "normalize-space(//span[@class='selected']//text())"
            )
            item['imagen'] = sel.xpath(
                "/html/body/li/div[2]/div/div/img[1]/@src"
            ).get()
            item['promocion'] = sel.xpath(
                "normalize-space(/html/body/li/div/div/p)"
            ).get()
            # if 'Oferta' in item['promocion']:
            #     item['precioPromocional'] = item['promocion'].replace('Oferta', '')
            if item['segmento'] != '':
                    item['posicionSegmento'] = self.pos
            else:
                item['posicionSubcategoria'] = self.pos

            self.pos += 1

            yield item


        if self.c < len(self.cat) - 1:
            self.c += 1
            self.pos = 1
            yield Request(
                self.base_url.format(self.cat[self.c]),
                callback=self.parse,
            )
        else:
            print('finished')

从scrapy.spider导入爬行spider
从scrapy.http导入请求
从scrapy.selector导入选择器
从..项导入ProductGeneriCoItem
类蜘蛛（爬行蜘蛛）：
name=“vea”
位置=1
基本url=”https://www.veadigital.com.ar/Comprar/Home.aspx#_atCategory=false&_atGrilla=true&_id={0}"
c=0
猫=[
141446年#一个hierbas基地
446126#aguas sin gas
446127，#阿古斯康涅狄格州天然气公司
446128，#aguas saborizadas
141231，开胃酒
141236，#加塞奥斯可乐
]
起始URL=[
基本url.format（类别[c]）
]
def解析（自我，响应）：
项目=ProductGeneriCoItem（）
product_info=response.xpath（“//li[@class='grilla-producto-container full layout']”）。getall（）
对于产品信息中的p：
sel=选择器（文本=p）
项目['repetido']=False
项目['superMercado']='Vea阿根廷'
项目['sucursal']=“否”
项目['marca']='
项目['empresa']=''
项目['ean']='
物料['sku']=''
项目['idArticulo']='
项['nombre']=sel.xpath(
“规范化空间（/html/body/li/div[2]/div/div[2]/div/div//text（））”
).get（）
项目['Description']=''
precio=sel.xpath(
“规范化空间（/html/body/li/div[2]/div/div[2]/div/div[2]/text（））”
).get（）
centavos=sel.xpath(
“规范化空间（/html/body/li/div[2]/div/div[2]/div/div[2]/span/text（））”
).get（）
项目['precio']=precio+，'+centavos
项目['precioPromocional']='
项目['condicion']=''
项['precioPorMedida']=sel.xpath(
“规范化空间（/html/body/li/div[2]/div/div[2]/div/div[3]/text（））”
).get（）
项目['stock']='
项目['categoria']='Bebidas'
项['subcategoria']=response.xpath(
“规范化空间（//div[@class='category-breadcrumbs']/a//text（））”
)
项['segmento']=response.xpath(
“规范化空间（//span[@class='selected']//text（））”
)
项['imagen']=sel.xpath(
“/html/body/li/div[2]/div/div/img[1]/@src”
).get（）
项['promocion']=sel.xpath(
“规范化空间（/html/body/li/div/div/p）”
).get（）
#如果项目['promotion']中的'Oferta'：
#项目['Preciopromional']=项目['Promotion'].替换（'Oferta'，''）
如果项目['segmento']！=''：
项目['posicionSegmento']=self.pos
其他：
项目['posicionsusbcategoria']=self.pos
self.pos+=1
收益项目
如果self.c

好吧，假设只使用参数。会话和内部代码仍可能存在于代码隐藏中。会话可以包含其调用的引用URL或页面

我经常不得不重新定向一个页面，因为虽然我可能有一些参数，但我仍然有一些会话变量设置和以前的代码设置要运行。那么，如果传入页面缺少这些内部会话值？然后我重新定向，因为我需要运行前面的页面代码设置来加载信息和所需的值

在某种程度上，这与桌面代码没有太大区别。您可能在客户页面上，然后点击添加发票。因此，代码将运行以获取和设置诸如发票付款条件和大量其他内容，然后启动实际表单以输入发票。这样的代码会传递到asp.net

然后是引用URL的简单问题。我有一个用户评论反馈页面。这是网站上少数几个允许未登录用户输入内容的地方之一。但一些垃圾邮件机器人正在滥用这一功能。（而且他们不必登录即可使用反馈页面）

因此，现在代码检查引用URL（启动页面的URL）的反馈页面（代码隐藏）。如果引用的URL不是来自我的网站，那么我将重新定向回主页。从用户的角度来看，您输入的URL似乎不起作用。因此，通常出于安全原因，我们会检查引用的URL，如果该页面不是由网站启动的，那么我们知道并拒绝该请求

这意味着我的许多URL只有在由网站启动时才能工作。如果您尝试直接输入URL，或者从web刮板输入URL？然后引用的URL不再来自我的站点

因此，我将直接转到前面的页面，以确保在实际进入相关页面之前，所有类型的设置代码和内容都是正确的

我的意思是，为了显示项目页面？用户必须搜索，然后找到项目。然后单击该项目行将设置大量内容，然后跳转到项目查看页面，然后显示该项目页面。

[asp.net]相关文章推荐

随机文章推荐