Python 如何使用累积的渐变更新模型参数?

Python 如何使用累积的渐变更新模型参数?,python,tensorflow,gradient,Python,Tensorflow,Gradient,我正在使用TensorFlow构建一个深度学习模型。对TensorFlow来说是新的 由于某些原因,我的模型具有有限的批量大小,那么这个有限的批量大小将使模型具有较高的方差 所以,我想用一些技巧来扩大批量。我的想法是存储每个小批量的梯度,例如64个小批量,然后将梯度相加,使用这64个小批量训练数据的平均梯度来更新模型的参数 这意味着,对于前63个小批量,不更新参数,在64个小批量之后,只更新一次模型参数 但是由于TensorFlow是基于图形的,有人知道如何实现这个想要的特性吗 非常感谢。我在这

我正在使用TensorFlow构建一个深度学习模型。对TensorFlow来说是新的

由于某些原因,我的模型具有有限的批量大小,那么这个有限的批量大小将使模型具有较高的方差

所以,我想用一些技巧来扩大批量。我的想法是存储每个小批量的梯度,例如64个小批量,然后将梯度相加,使用这64个小批量训练数据的平均梯度来更新模型的参数

这意味着,对于前63个小批量,不更新参数,在64个小批量之后,只更新一次模型参数

但是由于TensorFlow是基于图形的,有人知道如何实现这个想要的特性吗

非常感谢。

我在这里找到了一个解决方案:

在培训循环中:

while True:
    sess.run(zero_ops)
    for i in xrange(n_minibatches):
        sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
    sess.run(train_step)
但是这段代码看起来不是很干净漂亮,有人知道如何优化这些代码吗?

我在这里找到了一个解决方案:

在培训循环中:

while True:
    sess.run(zero_ops)
    for i in xrange(n_minibatches):
        sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
    sess.run(train_step)

但是这段代码看起来不是很干净漂亮,有人知道如何优化这些代码吗?

我也遇到了同样的问题,刚刚解决了

首先获取符号渐变,然后将累积渐变定义为tf.Variables。(似乎在定义
grads\u acum
之前必须运行
tf.global\u variables\u initializer()
。否则我会出错,不确定原因。)

tvars=tf.可训练的_变量()
优化器=tf.train.GradientDescentOptimizer(lr)
梯度=tf梯度(成本,TVAR)
#初始化
tf.local_variables_initializer().run()
tf.global_variables_initializer().run()
grads_acum=[tf.变量(类似于(v))表示v的梯度]
更新_op=optimizer。应用_梯度(zip(grads_acum,tvars))
在培训中,您可以在每个批次累积渐变(保存在
渐变\u accum
),并在运行第64批次后更新模型:

feed_dict = dict()
for i, _grads in enumerate(gradients_accum):
    feed_dict[grads_accum[i]] = _grads
sess.run(fetches=[update_op], feed_dict=feed_dict) 
例如,您可以参考用法,尤其是此函数:
testGradientsAsVariables()


希望能有帮助

我也遇到了同样的问题,刚刚解决了

首先获取符号渐变,然后将累积渐变定义为tf.Variables。(似乎在定义
grads\u acum
之前必须运行
tf.global\u variables\u initializer()
。否则我会出错,不确定原因。)

tvars=tf.可训练的_变量()
优化器=tf.train.GradientDescentOptimizer(lr)
梯度=tf梯度(成本,TVAR)
#初始化
tf.local_variables_initializer().run()
tf.global_variables_initializer().run()
grads_acum=[tf.变量(类似于(v))表示v的梯度]
更新_op=optimizer。应用_梯度(zip(grads_acum,tvars))
在培训中,您可以在每个批次累积渐变(保存在
渐变\u accum
),并在运行第64批次后更新模型:

feed_dict = dict()
for i, _grads in enumerate(gradients_accum):
    feed_dict[grads_accum[i]] = _grads
sess.run(fetches=[update_op], feed_dict=feed_dict) 
例如,您可以参考用法,尤其是此函数:
testGradientsAsVariables()


希望能有帮助

以前的解决方案没有计算累积梯度的平均值,这可能导致训练不稳定。我已经修改了上面的代码,应该可以解决这个问题

# Fetch a list of our network's trainable parameters.
trainable_vars = tf.trainable_variables()

# Create variables to store accumulated gradients
accumulators = [
    tf.Variable(
        tf.zeros_like(tv.initialized_value()),
        trainable=False
    ) for tv in trainable_vars
]

# Create a variable for counting the number of accumulations
accumulation_counter = tf.Variable(0.0, trainable=False)

# Compute gradients; grad_pairs contains (gradient, variable) pairs
grad_pairs = optimizer.compute_gradients(loss, trainable_vars)

# Create operations which add a variable's gradient to its accumulator.
accumulate_ops = [
    accumulator.assign_add(
        grad
    ) for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)
]

# The final accumulation operation is to increment the counter
accumulate_ops.append(accumulation_counter.assign_add(1.0))

# Update trainable variables by applying the accumulated gradients
# divided by the counter. Note: apply_gradients takes in a list of 
# (grad, var) pairs
train_step = optimizer.apply_gradients(
    [(accumulator / accumulation_counter, var) \
        for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)]
)

# Accumulators must be zeroed once the accumulated gradient is applied.
zero_ops = [
    accumulator.assign(
        tf.zeros_like(tv)
    ) for (accumulator, tv) in zip(accumulators, trainable_vars)
]

# Add one last op for zeroing the counter
zero_ops.append(accumulation_counter.assign(0.0))

此代码的使用方式与@weixsong提供的相同。

以前的解决方案不计算累积梯度的平均值,这可能导致训练不稳定。我已经修改了上面的代码,应该可以解决这个问题

# Fetch a list of our network's trainable parameters.
trainable_vars = tf.trainable_variables()

# Create variables to store accumulated gradients
accumulators = [
    tf.Variable(
        tf.zeros_like(tv.initialized_value()),
        trainable=False
    ) for tv in trainable_vars
]

# Create a variable for counting the number of accumulations
accumulation_counter = tf.Variable(0.0, trainable=False)

# Compute gradients; grad_pairs contains (gradient, variable) pairs
grad_pairs = optimizer.compute_gradients(loss, trainable_vars)

# Create operations which add a variable's gradient to its accumulator.
accumulate_ops = [
    accumulator.assign_add(
        grad
    ) for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)
]

# The final accumulation operation is to increment the counter
accumulate_ops.append(accumulation_counter.assign_add(1.0))

# Update trainable variables by applying the accumulated gradients
# divided by the counter. Note: apply_gradients takes in a list of 
# (grad, var) pairs
train_step = optimizer.apply_gradients(
    [(accumulator / accumulation_counter, var) \
        for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)]
)

# Accumulators must be zeroed once the accumulated gradient is applied.
zero_ops = [
    accumulator.assign(
        tf.zeros_like(tv)
    ) for (accumulator, tv) in zip(accumulators, trainable_vars)
]

# Add one last op for zeroing the counter
zero_ops.append(accumulation_counter.assign(0.0))

此代码的使用方式与@weixsong提供的相同。

如果我不在sess.run(训练步骤)中再次给出提要,您发布的方法似乎会失败。我不知道为什么需要feed_dict,但有可能再次运行所有累加器,并重复上一个示例。在我的情况下,我必须这样做:

            self.session.run(zero_ops)
            for i in range(0, mini_batch):

                self.session.run(accum_ops, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})

            self.session.run(norm_acums, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
            self.session.run(train_op, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
对于标准化梯度,我知道这只是将累积梯度n除以batchsize,所以我只添加一个新的op

norm_accums = [accum_op/float(batchsize) for accum_op in accum_ops]
有人也有同样的问题吗

*更新 正如我所说,这是错误的,它使用批处理中的最后一个示例再次运行all图。 这个小代码测试

import numpy as np
import tensorflow as tf
ph = tf.placeholder(dtype=tf.float32, shape=[])
var_accum = tf.get_variable("acum", shape=[], 
initializer=tf.zeros_initializer())
acum = tf.assign_add(var_accum, ph)
divide = acum/5.0
init = tf.global_variables_initializer()
    with tf.Session() as sess:
    sess.run(init)
    for i in range(5):
         sess.run(acum, feed_dict={ph: 2.0})

c = sess.run([divide], feed_dict={ph: 2.0})
#10/5 = 2
print(c)
#but it gives 2.4, that is 12/5, so sums one more time
我想出了解决这个问题的办法。因此,tensorflow有条件运算。我把 一个分支中的累积和另一个分支中的最后一次归一化和更新累积。我的代码乱七八糟,但为了快速检查,我要说的是,我让一个使用示例的小代码

import numpy as np
import tensorflow as tf

ph = tf.placeholder(dtype=tf.float32, shape=[])
#placeholder for conditional braching in the graph
condph = tf.placeholder(dtype=tf.bool, shape=[])

var_accum = tf.get_variable("acum", shape=[], initializer=tf.zeros_initializer())

accum_op = tf.assign_add(var_accum, ph)

#function when condition of condph is True
def truefn():
   return accum_op
#function when condtion of condph is False
def falsefn():
   div = accum_op/5.0
   return div

#return the conditional operation
cond = tf.cond(condph, truefn, falsefn)

init = tf.global_variables_initializer()

with tf.Session() as sess:
   sess.run(init)
   for i in range(4):
       #run only accumulation
       sess.run(cond, feed_dict={ph: 2.0, condph: True})
   #run acumulation and divition
   c = sess.run(cond, feed_dict={ph: 2.0, condph: False})

print(c)
#now gives 2

*重要提示:忘记所有不起作用的事情。优化器失败。

如果我不在sess.run(训练步骤)中再次给出提要,则您发布的方法似乎失败。我不知道为什么需要feed_dict,但有可能再次运行所有累加器,并重复上一个示例。在我的情况下,我必须这样做:

            self.session.run(zero_ops)
            for i in range(0, mini_batch):

                self.session.run(accum_ops, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})

            self.session.run(norm_acums, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
            self.session.run(train_op, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
对于标准化梯度,我知道这只是将累积梯度n除以batchsize,所以我只添加一个新的op

norm_accums = [accum_op/float(batchsize) for accum_op in accum_ops]
有人也有同样的问题吗

*更新 正如我所说,这是错误的,它使用批处理中的最后一个示例再次运行all图。 这个小代码测试

import numpy as np
import tensorflow as tf
ph = tf.placeholder(dtype=tf.float32, shape=[])
var_accum = tf.get_variable("acum", shape=[], 
initializer=tf.zeros_initializer())
acum = tf.assign_add(var_accum, ph)
divide = acum/5.0
init = tf.global_variables_initializer()
    with tf.Session() as sess:
    sess.run(init)
    for i in range(5):
         sess.run(acum, feed_dict={ph: 2.0})

c = sess.run([divide], feed_dict={ph: 2.0})
#10/5 = 2
print(c)
#but it gives 2.4, that is 12/5, so sums one more time
我想出了解决这个问题的办法。因此,tensorflow有条件运算。我把 一个分支中的累积和另一个分支中的最后一次归一化和更新累积。我的代码乱七八糟,但为了快速检查,我要说的是,我让一个使用示例的小代码

import numpy as np
import tensorflow as tf

ph = tf.placeholder(dtype=tf.float32, shape=[])
#placeholder for conditional braching in the graph
condph = tf.placeholder(dtype=tf.bool, shape=[])

var_accum = tf.get_variable("acum", shape=[], initializer=tf.zeros_initializer())

accum_op = tf.assign_add(var_accum, ph)

#function when condition of condph is True
def truefn():
   return accum_op
#function when condtion of condph is False
def falsefn():
   div = accum_op/5.0
   return div

#return the conditional operation
cond = tf.cond(condph, truefn, falsefn)

init = tf.global_variables_initializer()

with tf.Session() as sess:
   sess.run(init)
   for i in range(4):
       #run only accumulation
       sess.run(cond, feed_dict={ph: 2.0, condph: True})
   #run acumulation and divition
   c = sess.run(cond, feed_dict={ph: 2.0, condph: False})

print(c)
#now gives 2

*重要提示:忘记所有不起作用的事情。优化器失败。

您可以使用Pytorch而不是Tensorflow,因为它允许用户在培训期间累积梯度。

您可以使用Pytorch而不是Tensorflow,因为它允许用户在培训期间累积梯度。

Tensorflow 2.0兼容答案:与weixsong的答案一致上面提到的以及中提供的解释,下面提到的是Tensorflow版本2.0中累积梯度的代码:

def train(epochs):
  for epoch in range(epochs):
    for (batch, (images, labels)) in enumerate(dataset):
       with tf.GradientTape() as tape:
        logits = mnist_model(images, training=True)
        tvs = mnist_model.trainable_variables
        accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
        zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
        loss_value = loss_object(labels, logits)

       loss_history.append(loss_value.numpy().mean())
       grads = tape.gradient(loss_value, tvs)
       #print(grads[0].shape)
       #print(accum_vars[0].shape)
       accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]



    optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
    print ('Epoch {} finished'.format(epoch))

# call the above function    
train(epochs = 3)
完整的代码可以在这里找到<