Neural network 理解Pytork中LSTMCell的反向机制
我想钩住pytorch中LSTMCell函数的后向过程,因此在初始化过程中我执行以下操作(num_layers=4,hidden_size=1000,input_size=1000): 在正向过程中,我只需在序列长度和num_层上迭代LSTMCell,如下所示:Neural network 理解Pytork中LSTMCell的反向机制,neural-network,lstm,pytorch,recurrent-neural-network,Neural Network,Lstm,Pytorch,Recurrent Neural Network,我想钩住pytorch中LSTMCell函数的后向过程,因此在初始化过程中我执行以下操作(num_layers=4,hidden_size=1000,input_size=1000): 在正向过程中,我只需在序列长度和num_层上迭代LSTMCell,如下所示: for j in range(seqlen): input = #some tensor of size (batch_size, input_size) for i, rnn in enumer
for j in range(seqlen):
input = #some tensor of size (batch_size, input_size)
for i, rnn in enumerate(self.layers):
# recurrent cell
hidden, cell = rnn(input, (prev_hiddens[i], prev_cells[i]))
如果输入大小为(批大小,输入大小)
,上一个隐藏的[i]
大小为(批大小,隐藏大小)
,上一个单元格[i]
大小为(批大小,隐藏大小)
在backward\u hook
中,我打印输入到该函数的张量的大小:
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output {}".format(grad))
for grad in grad_input:
print ("grad_input.size () {}".format(grad.size()))
作为结果,第一次调用了backward\u hook
,例如:
[A] 对于grad\u输出
我得到了两个张量,其中第二个张量是None
。这是可以理解的,因为在后向阶段,我们有内部状态梯度(c)和输出梯度(h)。时间维度中的最后一次迭代没有隐藏未来,因此其梯度为零
[B] 对于梯度输入,我得到5个张量(批量大小=9):
我的问题是:
(1) 我的理解正确吗
(2) 如何解释grad_输入元组中的5个张量?我认为应该只有3个,因为只有3个输入到LSTMCell forward()
谢谢您对
grad\u输入
和grad\u输出
的理解是错误的。我试图用一个简单的例子来解释它
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output.size {}".format(grad.size()))
for grad in grad_input:
if grad is None:
print('None')
else:
print ("grad_input.size: {}".format(grad.size()))
print()
model = nn.Linear(10, 20)
model.register_backward_hook(backward_hook)
input = torch.randn(8, 3, 10)
Y = torch.randn(8, 3, 20)
Y_pred = []
for i in range(input.size(1)):
out = model(input[:, i])
Y_pred.append(out)
loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
loss.backward()
输出为:
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
解释
:层输出的损耗w.r.t的梯度,grad\u输出
Y\u pred
:层输入的损耗梯度。对于梯度输入
层,输入是线性
张量和输入
和权重
偏差
grad_input.size: torch.Size([8, 20]) # for the `bias`
None # for the `input`
grad_input.size: torch.Size([10, 20]) # for the `weight`
PyTorch中的
Linear
层使用LinearFunction
,如下所示
class LinearFunction(Function):
# Note that both forward and backward are @staticmethods
@staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
@staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input, grad_weight, grad_bias
对于LSTM,有四组权重参数
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
因此,在您的例子中,
grad_输入
将是由5个张量组成的元组。正如你提到的,grad\u输出
是两个张量。你对grad\u输入
和grad\u输出
的理解是错误的。我试图用一个简单的例子来解释它
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output.size {}".format(grad.size()))
for grad in grad_input:
if grad is None:
print('None')
else:
print ("grad_input.size: {}".format(grad.size()))
print()
model = nn.Linear(10, 20)
model.register_backward_hook(backward_hook)
input = torch.randn(8, 3, 10)
Y = torch.randn(8, 3, 20)
Y_pred = []
for i in range(input.size(1)):
out = model(input[:, i])
Y_pred.append(out)
loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
loss.backward()
输出为:
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
解释
:层输出的损耗w.r.t的梯度,grad\u输出
Y\u pred
:层输入的损耗梯度。对于梯度输入
层,输入是线性
张量和输入
和权重
偏差
grad_input.size: torch.Size([8, 20]) # for the `bias`
None # for the `input`
grad_input.size: torch.Size([10, 20]) # for the `weight`
PyTorch中的
Linear
层使用LinearFunction
,如下所示
class LinearFunction(Function):
# Note that both forward and backward are @staticmethods
@staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
@staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input, grad_weight, grad_bias
对于LSTM,有四组权重参数
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
因此,在您的例子中,
grad_输入
将是由5个张量组成的元组。正如你提到的,grad\u输出
是两个张量。为什么在线性情况下(你的例子),输入的梯度是零,而偏差有两个梯度?(有打字错误吗?)我仍然不明白为什么grad_输入中的张量在我的输出中有大小。如果包括重量,则重量应为(1000x1000)或(1000x4000)。我得到的张量没有那样的大小。为什么在线性情况下(你的例子),输入的梯度是无的,而偏置有两个梯度?(有打字错误吗?)我仍然不明白为什么grad_输入中的张量在我的输出中有大小。如果包括重量,则重量应为(1000x1000)或(1000x4000)。我得到的张量没有那样的大小。