Tensorflow L2正规化在训练中不断增加

Tensorflow L2正规化在训练中不断增加,tensorflow,deep-learning,Tensorflow,Deep Learning,我正在TensorFlow上微调InceptionResnetV2。在训练过程中,正则化损失保持线性增长,在训练后期甚至远大于交叉熵损失。我已经检查了培训程序,并确保我正在优化交叉熵损失和L2损失的组合 有人能解释一下这件奇怪的事吗?任何反馈都将不胜感激 下面是代码和一些张力板图 import tensorflow as tf from tensorflow.python.platform import tf_logging as logging from inception_resnet_v2

我正在TensorFlow上微调InceptionResnetV2。在训练过程中,正则化损失保持线性增长,在训练后期甚至远大于交叉熵损失。我已经检查了培训程序,并确保我正在优化交叉熵损失和L2损失的组合

有人能解释一下这件奇怪的事吗?任何反馈都将不胜感激

下面是代码和一些张力板图

import tensorflow as tf
from tensorflow.python.platform import tf_logging as logging
from inception_resnet_v2 import inception_resnet_v2, inception_resnet_v2_arg_scope
import os
import time
from preprocessing import aug_parallel_v2
import numpy as np

slim = tf.contrib.slim

# total training data number
sample_num = 625020

data_path = 'iNaturalist_train.tfrecords'

# State where your log file is at. If it doesn't exist, create it.
log_dir = './log_v5'
# tensorboard visualization path
filewriter_path = './filewriter_v5_Logits'

# State where your checkpoint file is
checkpoint_file = './inception_resnet_v2_2016_08_30.ckpt'
checkpoint_save_addr = './log_v5/fine-tuning_v5.ckpt'
# State the image size you're resizing your images to. We will use the default inception size of 299.
image_size = 299

# State the number of classes to predict:
num_classes = 8142

# ================= TRAINING INFORMATION ==================
# State the number of epochs to train
num_epochs = 5

# State your batch size
batch_size = 60

# Learning rate information and configuration
initial_learning_rate = 0.0005
learning_rate_decay_factor = 0.8
num_epochs_before_decay = 2

# put weight on different classes inversely proportional
# to total number of their image samples
label_count = np.loadtxt('label_count.txt', dtype=int)
inverse = lambda t: 1 / t
vfunc = np.vectorize(inverse)
multiplier = vfunc(label_count)
multiplier /= np.mean(multiplier)

def run():

    if not os.path.exists(log_dir):
        os.mkdir(log_dir)

    feature = {'train/height': tf.FixedLenFeature([], tf.int64),
               'train/width': tf.FixedLenFeature([], tf.int64),
               'train/image': tf.FixedLenFeature([], tf.string),
               'train/label': tf.FixedLenFeature([], tf.int64),
               'train/sup_label': tf.FixedLenFeature([], tf.int64),
               'train/aug_level': tf.FixedLenFeature([], tf.int64)}

    # create a list of file names
    filename_queue = tf.train.string_input_producer([data_path], num_epochs=None)
    print(filename_queue)

    reader = tf.TFRecordReader()
    _, tfrecord_serialized = reader.read(filename_queue)

    features = tf.parse_single_example(tfrecord_serialized, features=feature)

    # Convert the image data from string back to the numbers
    height = tf.cast(features['train/height'], tf.int64)
    width = tf.cast(features['train/width'], tf.int64)

    # change this line for your TFrecord version
    tf_image = tf.image.decode_jpeg(features['train/image'])

    tf_label = tf.cast(features['train/label'], tf.int32)
    aug_level = tf.cast(features['train/aug_level'], tf.int32)
    # tf_sup_label = tf.cast(features['train/sup_label'], tf.int64)

    tf_image = tf.reshape(tf_image, tf.stack([height, width, 3]))
    tf_label = tf.reshape(tf_label, [1])
    aug_level = tf.reshape(aug_level, [1])

    resized_image = tf.image.resize_images(images=tf_image, size=tf.constant([400, 400]), method=2)
    resized_image = tf.cast(resized_image, tf.uint8)
    tf_images, tf_labels, tf_aug = tf.train.shuffle_batch([resized_image, tf_label, aug_level], batch_size=batch_size,
                                                      capacity=2048, num_threads=16, allow_smaller_final_batch=False,
                                                      min_after_dequeue=256)


    tf.logging.set_verbosity(tf.logging.INFO)  # Set the verbosity to INFO level

    IMAGE_HEIGHT = 299
    IMAGE_WIDTH = 299

    images = tf.placeholder(dtype=tf.float32, shape=[None, 299, 299, 3])
    labels = tf.placeholder(dtype=tf.int32, shape=[None, 1])
    weighted_level = tf.placeholder(dtype=tf.float32, shape=[None, 1])

    # Know the number steps to take before decaying the learning rate and batches per epoch
    num_batches_per_epoch = int(sample_num / batch_size)
    num_steps_per_epoch = num_batches_per_epoch  # Because one step is one batch processed
    decay_steps = int(num_epochs_before_decay * num_steps_per_epoch)

    # Create the model inference
    with slim.arg_scope(inception_resnet_v2_arg_scope()):
        logits, end_points = inception_resnet_v2(images, num_classes=num_classes, is_training=True)

    # Define the scopes that you want to exclude for restoration
    exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits']
    variables_to_restore = slim.get_variables_to_restore(exclude=exclude)

    print("label test")
    print(labels)
    print(logits)

    # Perform one-hot-encoding of the labels (Try one-hot-encoding within the load_batch function!)
    one_hot_labels = tf.squeeze(tf.one_hot(labels, num_classes), [1])

    print(one_hot_labels)
    print(logits)

    weighted_onehot = tf.multiply(one_hot_labels, weighted_level)

    # Performs the equivalent to tf.nn.sparse_softmax_cross_entropy_with_logits but enhanced with checks
    digits_loss = tf.losses.softmax_cross_entropy(onehot_labels=weighted_onehot, logits=logits)

    reg_loss = tf.losses.get_regularization_loss()

    total_loss = digits_loss + reg_loss

    # Define your exponentially decaying learning rate
    lr = tf.train.exponential_decay(
        learning_rate=initial_learning_rate,
        global_step=global_step,
        decay_steps=decay_steps,
        decay_rate=learning_rate_decay_factor,
        staircase=True)

    # train_vars = []
    # Now we can define the optimizer that takes on the learning rate
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                          "InceptionResnetV2/Logits")

    # RMSProp or Adam

    optimizer = tf.train.AdamOptimizer(learning_rate=lr)

    # Create the train_op.
    train_op = slim.learning.create_train_op(total_loss, optimizer, variables_to_train=train_vars)

    predictions = tf.argmax(end_points['Predictions'], 1)
    probabilities = end_points['Predictions']
    accuracy, accuracy_update = tf.metrics.accuracy(predictions, labels)
    metrics_op = tf.group(accuracy_update, probabilities)

    tf.summary.scalar('losses/Reg_Loss', reg_loss)
    tf.summary.scalar('losses/Digit_Loss', digits_loss)
    tf.summary.scalar('losses/Total_Loss', total_loss)
    tf.summary.scalar('accuracy', accuracy)
    tf.summary.scalar('learning_rate', lr)
    writer = tf.summary.FileWriter(filewriter_path)
    writer.add_graph(tf.get_default_graph())

    my_summary_op = tf.summary.merge_all()

    def train_step(sess, train_op, global_step, imgs, lbls, weight):
        '''
        Simply runs a session for the three arguments provided and gives a logging on the time elapsed
        for each global step
        '''
        # Check the time for each sess run
        start_time = time.time()

        total_loss, global_step_count, _ = sess.run([train_op, global_step, metrics_op],
                                                    feed_dict={images: imgs, labels: lbls, weighted_level: weight})

        time_elapsed = time.time() - start_time

        # Run the logging to print some results
        logging.info('global step %s: digit_loss: %.4f (%.2f sec/step)',
                     global_step_count, total_loss, time_elapsed)

        return total_loss, global_step_count

    saver_pretrain = tf.train.Saver(variables_to_restore)
    saver_train = tf.train.Saver(train_vars)

    with tf.Session() as sess:

        init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
        sess.run(init_op)

        # Create a coordinator and run all QueueRunner objects
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)

        saver_pretrain.restore(sess, checkpoint_file)

        start_time = time.time()

        for step in range(int(num_steps_per_epoch * num_epochs)):

            imgs, lbls, augs = sess.run([tf_images, tf_labels, tf_aug])

            imgs, lbls = aug_parallel_v2(imgs, lbls, augs)

            imgs = imgs[:, 50:349, 50:349, :]

            imgs = 2*(imgs.astype(np.float32)) - 1

            lbls = lbls.astype(np.int32)

            weight = multiplier[lbls]

            weight = np.array(weight).reshape((batch_size, 1))

            # print(imgs[0, 0:10, 0:10, 0:2])

            if step % num_batches_per_epoch == 0:
                logging.info('Epoch %s/%s', step / num_batches_per_epoch + 1, num_epochs)

                learning_rate_value, accuracy_value = sess.run([lr, accuracy],
                                                feed_dict={images: imgs, labels: lbls, weighted_level: weight})

                logging.info('Current Learning Rate: %s', learning_rate_value)
                logging.info('Current Streaming Accuracy: %s', accuracy_value)

                # optionally, print your logits and predictions for a sanity check that things are going fine.
                logits_value, probabilities_value, predictions_value, labels_value = sess.run(
                    [logits, probabilities, predictions, labels],
                    feed_dict={images: imgs, labels: lbls, weighted_level: weight})

                print('logits: \n', logits_value)

                print('Probabilities: \n', probabilities_value)

                print('predictions: \n', predictions_value)

                print('Labels:\n:', labels_value)

            # Log the summaries every 10 step.
            if step % 20 == 0:

                loss, global_step_count = train_step(sess, train_op, global_step, imgs, lbls, weight)

                summaries = sess.run(my_summary_op, feed_dict={images: imgs, labels: lbls, weighted_level: weight})

                writer.add_summary(summaries, global_step_count)
                # sess.summary_computed(sess, summaries)

            # If not, simply run the training step

            else:
                loss, _ = train_step(sess, train_op, global_step, imgs, lbls, weight)

            if step % 2000 == 0:

                logging.info('Saving model to disk now.')
                saver_train.save(sess, checkpoint_save_addr, global_step=global_step)

            print('one batch time: ', time.time() - start_time)

            start_time = time.time()

        # We log the final training loss and accuracy
        logging.info('Final Loss: %s', loss)
        logging.info('Final Accuracy: %s', sess.run(accuracy))

        # Once all the training has been done, save the log files and checkpoint model
        logging.info('Finished training! Saving model to disk now.')
        saver_train.save(sess, checkpoint_save_addr, global_step=global_step)

        # Stop the threads
        coord.request_stop()

        # Wait for threads to stop
        coord.join(threads)
        sess.close()

if __name__ == '__main__':
    run()
我是新来的,没有足够的声誉来发布图片。 以下是精度图和损失图的两个链接。您可以很容易地看出正则化损失处于主导地位


这是一个很难回答的问题。不过我可以给你一些建议

一般来说,当您尝试最小化
数字丢失
,即使您的模型适合您的数据时,您将缓慢地更改层中的权重。为了防止潜在的过度拟合,L2正则化损失(代码中所有权重的平方和,
reg_损失
)通常会添加到总体损失中(
total_损失
),这两种力通常会相互作用,如果平衡正确,您可以训练出一个好的模型

在您的例子中,您正在使用为1001个类开发的网络(
resnet\u v2
),并尝试预测8142个类。这本身没有问题,但你打破了平衡。因此,我认为您需要将resnet v2的默认权重衰减0.00004覆盖到某个更高的值,在这一行中(注意,对于10倍的增加,小数中只有3个零):


较高的
weight\u decay
参数将迫使L2损耗更快地降低。问题是这个数字只是一个猜测,我不知道理想值是多少。你需要对多个值进行实验并找出答案。

我已经尝试过你的方法使用更大的权重。reg_损失迅速上升,逐渐缓慢下降。但重量衰减可能太大。我的模型的性能(精度)保持在非常低的水平。你知道在训练中注册丢失总是增加的原因吗?所以基本上,当你的模型在初始化后的最初几百步中开始成形时,注册丢失必须先增加。有些权重会变大,有些会衰减,但现在你得到的是一个非常任意的权重分布,而不是原来的正态分布。所以它一开始增加是很正常的。不幸的是,您必须进行实验,以找出哪个<代码>权重衰减值能提供最佳结果。这叫做超参数探索。如果您有更多的硬件可供使用,您可以同时运行多个具有不同值的副本。我理解如果我们从头开始训练,二级损失会增加。但我正在微调InceptionResnetV2,它的变量已经是合理的了。我认为L2损失在几千次迭代中不会增加。最奇怪的是数字丢失在开始时也在增加。也许我的代码中有一个bug…你正在从头开始训练最后一层。或者你的学习率太高了。但是是的,你也可能有其他性质的bug。欢迎来到深度学习…:)哈哈哈,谢谢你,先生!我尝试了几个权重衰减值,也尝试了一起微调所有变量,但无法解决这个问题。数字损耗先增大后逐渐减小;而reg_损失随时间先减小后线性增大。线性是如此的怪异和恼人。它完全阻止了数字损耗的优化。在超过20000次迭代中,reg_损失是digit_损失的三倍。。。。
with slim.arg_scope( inception_resnet_v2_arg_scope( weight_decay = 0.0004 ) ):