Parameters 存储Momentum的重量更新_Parameters_Pytorch

Parameters 存储Momentum的重量更新

parameters pytorch

Parameters 存储Momentum的重量更新,parameters,pytorch,Parameters,Pytorch,我正试图在我的SGD实施过程中以动力实现动力。据我了解，此更新如下所示： parameters -= (lr * (p.grad*0.1 + p_delta_prev*0.9)) 我的问题是，我应该如何存储每次更新中以前的增量以下是我的更新功能中的内容： #we now want to do the update with momentum #momentum takes derivative, multiplies it by 0.1, then takes the previous u

我正试图在我的SGD实施过程中以动力实现动力。据我了解，此更新如下所示：

parameters -= (lr * (p.grad*0.1 + p_delta_prev*0.9))

我的问题是，我应该如何存储每次更新中以前的增量

以下是我的更新功能中的内容：

#we now want to do the update with momentum
#momentum takes derivative, multiplies it by 0.1, then takes the previous update,
#multiplies it by 0.9 and we add the two together
#alpha = 0.1, beta = 0.9;  p-=grad*0.1 + p*0.9
def update(x,y,lr):
    wd = 1e-5
    y_hat = model(x)
    # weight decay
    w2 = 0.
    for p in model.parameters(): w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    with torch.no_grad():
        for p in model.parameters():
            #p.grad is the slope of the line of that parameter
            #current_p-previous_p to get difference
            p_update = (lr * (p.grad*0.1 + p*0.9))
            p.sub_(p_update)
            p.grad.zero_()
    return loss.item()

这里的

p*0.9

应替换为p_delta_prev。但是我应该如何为每个参数存储这些增量呢？如果我把它们保存到一个张量中，我将有效地将权重增量复制到内存中，使我的模型的大小增加两倍。实现这一目标的好方法是什么？我不想使用为我激活的内置函数。我确实查看了pytorch sgd.py，它看起来像是美国的商店

我已更新代码：

#we now want to do the update with momentum
#momentum takes derivative, multiplys it by 0.1, then takes the previous update,
#multiplies it by 0.9 and we add the two together
#alpha = 0.1, beta = 0.9;  p-=grad*0.1 + p*0.9
p_delta = {}
def update(x,y,lr):
    wd = 1e-5
    y_hat = model(x)
    # weight decay
    w2 = 0.
    for p in model.parameters(): w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    with torch.no_grad():
        i = 0
        for p in model.parameters():
            #p.grad is the slope of the line of that parameter
            if i not in p_delta:#check if key exists
                p_delta[i] = torch.zeros_like(p)
            p_update = (lr *p.grad) + (p_delta[i]*0.9)
            p_delta[i] = p_update.clone()
            p.sub_(p_update)
            p.grad.zero_()
            print((p_delta[i]))
            i+=1
    return loss.item()

我认为Excel电子表格中的代码不正确。Jeremy似乎展示了：

lr*（（p.grad*0.1）+（p_delta[i]*0.9））

但许多教程似乎展示了：

（lr*p.grad）+（p_delta[i]*0.9）

如果我们实现Jeremy的代码，损失实际上要比预期的慢。视频的一部分在这里：

是的，它确实将参数momenta存储在字典中，并按其名称索引，如返回的。我不知道如何严格证明这一点，但我坚信，如果不使用两倍于模型大小的额外内存，就不可能应用动量

也就是说，我不会担心，因为模型大小很少是整个算法内存消耗的一个大因素——保持反向传播算法的中间网络激活要昂贵得多。以VGG-16网络为例，它有1.38亿个参数（数字取自），如果以单精度存储，则略高于0.5gb。将其与现代GPU上的6gb+进行比较。

您能否详细说明是什么原因导致back prop的中间网络激活更昂贵？激活不只是一个向量（每个节点一个）吗？我不知道你所说的节点是什么意思。PyTorch中的大多数操作都会创建新的张量，但也会保留其向后传递的输入。您可以假设，随着向前传递的进行，几乎所有的中间值都会被保存（与NumPy不同，NumPy可以自由地对变量进行垃圾收集，而不会看到进一步的使用）。很难通过分析给出这些数字，但我建议您以一个示例用例（比如ImageNet上的training VGG）为例，尝试您的GPU可以处理的最大批量大小。然后计算

（gpu\U内存-网络\U权重）/batch\U大小

，并与用于权重的0.5gb进行比较。啊，好的，谢谢！