Python Tensorflow:_变量_与_权重_衰减（…）解释_Python_Tensorflow_Neural Network

Python Tensorflow:_变量_与_权重_衰减（…）解释

python tensorflow neural-network

Python Tensorflow:_变量_与_权重_衰减（…）解释,python,tensorflow,neural-network,Python,Tensorflow,Neural Network,此时我正在查看，我注意到文件中的函数_variable_与_weight_decay（…）。代码如下： def _variable_with_weight_decay(name, shape, stddev, wd): """Helper to create an initialized Variable with weight decay. Note that the Variable is initialized with a truncated normal distributio

此时我正在查看，我注意到文件中的函数_variable_与_weight_decay（…）。代码如下：

def _variable_with_weight_decay(name, shape, stddev, wd):
  """Helper to create an initialized Variable with weight decay.
  Note that the Variable is initialized with a truncated normal distribution.
  A weight decay is added only if one is specified.
  Args:
    name: name of the variable
    shape: list of ints
    stddev: standard deviation of a truncated Gaussian
    wd: add L2Loss weight decay multiplied by this float. If None, weight
        decay is not added for this Variable.
  Returns:
    Variable Tensor
  """
  dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
  var = _variable_on_cpu(
      name,
      shape,
      tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
  if wd is not None:
    weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
    tf.add_to_collection('losses', weight_decay)
  return var

我想知道这个函数是否按它所说的做。很明显，当给定权重衰减因子（wd而非None）时，将计算deacy值（权重衰减）。但它是否每一次都适用？最后，未修改的变量（var）是return，还是我遗漏了什么

第二个问题是如何解决这个问题？据我所知，必须从权重矩阵中的每个元素中减去标量权重衰减的值，但我找不到一个能做到这一点的tensorflow op（从张量的每个元素中加/减一个值）。有这样的手术吗？作为一种解决方法，我认为有可能创建一个新的张量，用weight_decay的值初始化，并使用tf.subtract（…）获得相同的结果。还是这是正确的选择

提前感谢。

代码会按它所说的做。您应该将

'loss'

集合（权重衰减项添加到倒数第二行）中的所有内容相加，以计算传递给优化器的损失。在该示例中的

loss（）

函数中：

tf.add_to_collection('losses', cross_entropy_mean)
[...]
return tf.add_n(tf.get_collection('losses'), name='total_loss')

因此，

loss（）

函数返回的是分类损失加上以前

'loss'

集合中的所有内容

作为旁注，权重衰减并不意味着从张量中的每个值中减去

wd

，作为更新步骤的一部分，它将该值乘以

（1-learning_rate*wd）

（普通SGD）。要了解为什么会这样，回想一下l2_损失计算

output = sum(t_i ** 2) / 2

以

t_i

作为张量的元素。这意味着关于每个张量元素的

l2_损失

的导数是该张量元素本身的值，并且由于您使用

wd

缩放了

l2_损失

，因此导数也被缩放

因为更新步骤（同样，在普通SGD中）是（请原谅我省略了时间步骤索引）

如果你只有重量衰减项，你会得到

w := w - learning_rate * wd * w

或

谢谢你的快速回答。你是对的。我对代码的复杂结构感到困惑，忘记了权重衰减不会影响图的结构，只是在权重更新期间使用。

w := w - learning_rate * wd * w

w := w * (1 - learning_rate * wd)