Python 批量规范化导致了训练和推理损失之间的巨大差异_Python_Tensorflow_Batch Normalization

Python 批量规范化导致了训练和推理损失之间的巨大差异

python tensorflow

Python 批量规范化导致了训练和推理损失之间的巨大差异,python,tensorflow,batch-normalization,Python,Tensorflow,Batch Normalization,我按照Tensorflow网页上的说明，在训练时将训练设置为真，在推理时设置为假（有效和测试）然而，批量规范化总是让我在培训和有效损失之间产生巨大差异，例如： 2018-09-11 09:22:34: step 993, loss 1.23001, acc 0.488638 2018-09-11 09:22:35: step 994, loss 0.969551, acc 0.567364 2018-09-11 09:22:35: step 995, loss 1.31113, acc 0.5

我按照Tensorflow网页上的说明，在训练时将

训练设置为真
，在推理时设置为假
（有效和测试）
然而，批量规范化总是让我在培训和有效损失之间产生巨大差异，例如：
2018-09-11 09:22:34: step 993, loss 1.23001, acc 0.488638
2018-09-11 09:22:35: step 994, loss 0.969551, acc 0.567364
2018-09-11 09:22:35: step 995, loss 1.31113, acc 0.5291
2018-09-11 09:22:35: step 996, loss 1.03135, acc 0.607861
2018-09-11 09:22:35: step 997, loss 1.16031, acc 0.549255
2018-09-11 09:22:36: step 998, loss 1.42303, acc 0.454694
2018-09-11 09:22:36: step 999, loss 1.33105, acc 0.496234
2018-09-11 09:22:36: step 1000, loss 1.14326, acc 0.527387
Round 4: valid
Loading from valid, 1383 samples available
2018-09-11 09:22:36: step 1000, loss 44.3765, acc 0.000743037
2018-09-11 09:22:36: step 1000, loss 36.9143, acc 0.0100708
2018-09-11 09:22:37: step 1000, loss 35.2007, acc 0.0304909
2018-09-11 09:22:37: step 1000, loss 39.9036, acc 0.00510307
2018-09-11 09:22:37: step 1000, loss 42.2612, acc 0.000225067
2018-09-11 09:22:37: step 1000, loss 29.9964, acc 0.0230831
2018-09-11 09:22:37: step 1000, loss 28.1444, acc 0.00278473

有时甚至更糟（对于另一款车型）：
我使用的批处理规范化代码：
def bn(inp, train_flag, name=None):
    return tf.layers.batch_normalization(inp, training=train_flag, name=name)

def gn(inp, groups=32):
    return tf.contrib.layers.group_norm(inp, groups=groups)

def conv(*args, padding='same', with_relu=True, with_bn=False,
         train_flag=None, with_gn=False, name=None, **kwargs):
    # inp, filters, kernel_size, strides
    use_bias = False if with_bn else True
    x = tf.layers.conv2d(*args, **kwargs, padding=padding,
                         kernel_initializer=xavier_initializer(),
                         use_bias=use_bias, name=name)
    if with_bn:
        bn_name = name+'/batchnorm' if name is not None else None
        x = bn(x, train_flag, name=bn_name)
    if with_gn: x = gn(x)
    if with_relu: x = relu(x)
    return x

在我移除批处理规范化层之后，训练和验证丢失之间的巨大差异就会消失
下面的代码用于优化
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):

模型从零开始训练，无需迁移学习
我关注了这个问题，并试图减少势头
，但也没有奏效
我想知道为什么会这样。如果你能给我一些建议，我将不胜感激
添加：train\u flag
是整个模型中使用的占位符。
由于您尚未提供完整的代码或其链接，我需要询问以下问题：
你怎么给火车旗加油
正确的方法是将列车标志
设置为tf.占位符
。还有其他方法，但这是最简单的方法。然后，您可以使用一个简单的python bool
为它提供信息
如果您在培训期间手动设置train\u flag=True
，并在验证期间将其设置为train\u flag=False
，则这可能是问题的根源。我没有在您的代码中看到reuse=tf.AUTO\u reuse
。这意味着在验证过程中，当您设置train\u flag=False
时，将创建一个单独的层，该层不会与训练过程中使用的前一层共享权重
当您不使用批处理规范化时，问题消失的原因是，在这种情况下，不需要对卷积层使用train\u flag
。所以，它工作正常
这是我根据观察得出的推测。
对于我的情况，我错误地只调用了update\u ops=tf.get\u collection（tf.GraphKeys.update\u ops）
一次
对于多个GPU，在定义每个子网络之前和之后，需要为每个GPU调用tf.get_collection（tf.GraphKeys.UPDATE_OPS）
。此外，在合并所有子网塔之后，还需要在应用梯度之前再次调用它
另一种方法是，在定义了整个网络（包括所有子网络）之后，调用update\u ops=tf.get\u collection（tf.GraphKeys.update\u ops）
以获取当前的update\u ops
。在这种情况下，我们需要两个for循环，一个用于定义网络，一个用于计算梯度
示例如下所示：
# Multiple GPUs
tmp, l = [], 0
for i in range(opt.gpu_num):
    r = min(l + opt.batch_split, opt.batchsize)
    with tf.device('/gpu:%d' % i), \
         tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):

        print("Setting up networks on GPU", i)
        inp_ = tf.identity(inps[l:r])
        label_ = tf.identity(labels[l:r])
        for j, val in enumerate(setup_network(inp_, label_)): # loss, pred, accuracy
            if i == 0: tmp += [[]] # [[], [], []]
            tmp[j] += [val]
    l = r

tmp += [[]]
# Calculate update_ops after the network has been defined
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # possible batch normalization
for i in range(opt.gpu_num):
    with tf.device('/gpu:%d' % i), \
         tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):

         print("Setting up gradients on GPU", i)
         tmp[-1] += [setup_grad(optim, tmp[0][i])]

增加：
我还添加了setup\u grad
功能
def setup_grad(optim, loss):
    # `compute_gradients`` will only run after update_ops have executed
    with tf.control_dependencies(update_ops):
        update_vars = None
        if opt.to_train is not None:
            update_vars = [tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=s)
                           for s in opt.to_train]
        total_loss = loss[0] + opt.seg_weight * loss[1]
        return optim.compute_gradients(total_loss, var_list=update_vars)

然后应用_梯度作为参考
# `apply_gradients`` will only run after update_ops have executed
with tf.control_dependencies(update_ops):
    if opt.clip_grad: grads = [(tf.clip_by_value(grad[0], -opt.clip_grad, opt.clip_grad), grad[1]) \
                                if grad[0] is not None else grad for grad in grads]
    train_op = optim.apply_gradients(grads, global_step=global_step)

如果每个GPU上的批大小较小，则批规格化可能对性能没有帮助，因为Tensorflow当前不支持GPU之间的同步批规格化层数据。
train\u flag=tf.placeholder（tf.bool，[]）已使用。
# `apply_gradients`` will only run after update_ops have executed
with tf.control_dependencies(update_ops):
    if opt.clip_grad: grads = [(tf.clip_by_value(grad[0], -opt.clip_grad, opt.clip_grad), grad[1]) \
                                if grad[0] is not None else grad for grad in grads]
    train_op = optim.apply_gradients(grads, global_step=global_step)