Neural network 理解Pytork中LSTMCell的反向机制_Neural Network_Lstm_Pytorch_Recurrent Neural Network

Neural network 理解Pytork中LSTMCell的反向机制

neural-network pytorch

Neural network 理解Pytork中LSTMCell的反向机制,neural-network,lstm,pytorch,recurrent-neural-network,Neural Network,Lstm,Pytorch,Recurrent Neural Network,我想钩住pytorch中LSTMCell函数的后向过程，因此在初始化过程中我执行以下操作（num_layers=4，hidden_size=1000，input_size=1000）：在正向过程中，我只需在序列长度和num_层上迭代LSTMCell，如下所示： for j in range(seqlen): input = #some tensor of size (batch_size, input_size) for i, rnn in enumer

我想钩住pytorch中LSTMCell函数的后向过程，因此在初始化过程中我执行以下操作（num_layers=4，hidden_size=1000，input_size=1000）：

在正向过程中，我只需在序列长度和num_层上迭代LSTMCell，如下所示：

for j in range(seqlen):            
    input = #some tensor of size (batch_size, input_size)
    for i, rnn in enumerate(self.layers):
        # recurrent cell
        hidden, cell = rnn(input, (prev_hiddens[i], prev_cells[i]))

如果输入大小为

（批大小，输入大小）

，

上一个隐藏的[i]

大小为

（批大小，隐藏大小）

，

上一个单元格[i]

大小为

（批大小，隐藏大小）

在

backward\u hook

中，我打印输入到该函数的张量的大小：

def backward_hook(module, grad_input, grad_output):
    for grad in grad_output:
        print ("grad_output {}".format(grad))

    for grad in grad_input:
         print ("grad_input.size () {}".format(grad.size()))

作为结果，第一次调用了

backward\u hook

，例如：

[A] 对于

grad\u输出

我得到了两个张量，其中第二个张量是

None

。这是可以理解的，因为在后向阶段，我们有内部状态梯度（c）和输出梯度（h）。时间维度中的最后一次迭代没有隐藏未来，因此其梯度为零

[B] 对于梯度输入，我得到5个张量（批量大小=9）：

我的问题是：

（1）我的理解正确吗

（2）如何解释grad_输入元组中的5个张量？我认为应该只有3个，因为只有3个输入到LSTMCell forward（）

谢谢

您对

grad\u输入

和

grad\u输出

的理解是错误的。我试图用一个简单的例子来解释它

def backward_hook(module, grad_input, grad_output):
    for grad in grad_output:
        print ("grad_output.size {}".format(grad.size()))

    for grad in grad_input:
        if grad is None:
            print('None')
        else:
            print ("grad_input.size: {}".format(grad.size()))
    print()

model = nn.Linear(10, 20)
model.register_backward_hook(backward_hook)

input = torch.randn(8, 3, 10)
Y = torch.randn(8, 3, 20)

Y_pred = []
for i in range(input.size(1)):
    out = model(input[:, i])
    Y_pred.append(out)

loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
loss.backward()

输出为：

grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])

grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])

grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])

解释

```
grad\u输出
```
：层输出的损耗w.r.t的梯度，
```
Y\u pred
```
```
梯度输入
```
：层输入的损耗梯度。对于
```
线性
```
层，输入是
```
输入
```
张量和
```
权重
```
和
```
偏差
```

因此，在输出中可以看到：

grad_input.size: torch.Size([8, 20])  # for the `bias`
None                                  # for the `input`
grad_input.size: torch.Size([10, 20]) # for the `weight`

PyTorch中的

Linear

层使用

LinearFunction

，如下所示

class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0).squeeze(0)

        return grad_input, grad_weight, grad_bias

对于LSTM，有四组权重参数

weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0

因此，在您的例子中，

grad_输入

将是由5个张量组成的元组。正如你提到的，

grad\u输出

是两个张量。

你对

grad\u输入

和

grad\u输出