Machine learning Tensorflow薄型列车，但在评估时总是预测相同_Machine Learning_Tensorflow_Computer Vision_Deep Learning

Machine learning Tensorflow薄型列车，但在评估时总是预测相同

machine-learning tensorflow computer-vision deep-learning

Machine learning Tensorflow薄型列车，但在评估时总是预测相同,machine-learning,tensorflow,computer-vision,deep-learning,Machine Learning,Tensorflow,Computer Vision,Deep Learning,我按照上面的链接制作了一个图像分类器培训代码： slim = tf.contrib.slim dataset_dir = './data' log_dir = './log' checkpoint_file = './inception_resnet_v2_2016_08_30.ckpt' image_size = 299 num_classes = 21 vlabels_file = './labels.txt' labels = open(labels_file, 'r') labels

我按照上面的链接制作了一个图像分类器

培训代码：

slim = tf.contrib.slim

dataset_dir = './data'
log_dir = './log'
checkpoint_file = './inception_resnet_v2_2016_08_30.ckpt'
image_size = 299
num_classes = 21
vlabels_file = './labels.txt'
labels = open(labels_file, 'r')
labels_to_name = {}
for line in labels:
    label, string_name = line.split(':')
    string_name = string_name[:-1]
    labels_to_name[int(label)] = string_name

file_pattern = 'test_%s_*.tfrecord'

items_to_descriptions = {
    'image': 'A 3-channel RGB coloured product image',
    'label': 'A label that from 20 labels'
}

num_epochs = 10
batch_size = 16
initial_learning_rate = 0.001
learning_rate_decay_factor = 0.7
num_epochs_before_decay = 4

def get_split(split_name, dataset_dir, file_pattern=file_pattern, file_pattern_for_counting='products'):
    if split_name not in ['train', 'validation']:
        raise ValueError(
            'The split_name %s is not recognized. Please input either train or validation as the split_name' % (
            split_name))

    file_pattern_path = os.path.join(dataset_dir, file_pattern % (split_name))

    num_samples = 0
    file_pattern_for_counting = file_pattern_for_counting + '_' + split_name
    tfrecords_to_count = [os.path.join(dataset_dir, file) for file in os.listdir(dataset_dir) if
                          file.startswith(file_pattern_for_counting)]
    for tfrecord_file in tfrecords_to_count:
        for record in tf.python_io.tf_record_iterator(tfrecord_file):
            num_samples += 1

    test = num_samples

    reader = tf.TFRecordReader

    keys_to_features = {
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpg'),
        'image/class/label': tf.FixedLenFeature(
            [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
    }

    items_to_handlers = {
        'image': slim.tfexample_decoder.Image(),
        'label': slim.tfexample_decoder.Tensor('image/class/label'),
    }

    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)

    labels_to_name_dict = labels_to_name

    dataset = slim.dataset.Dataset(
        data_sources=file_pattern_path,
        decoder=decoder,
        reader=reader,
        num_readers=4,
        num_samples=num_samples,
        num_classes=num_classes,
        labels_to_name=labels_to_name_dict,
        items_to_descriptions=items_to_descriptions)

    return dataset

def load_batch(dataset, batch_size, height=image_size, width=image_size, is_training=True):
    '''
    Loads a batch for training.

    INPUTS:
    - dataset(Dataset): a Dataset class object that is created from the get_split function
    - batch_size(int): determines how big of a batch to train
    - height(int): the height of the image to resize to during preprocessing
    - width(int): the width of the image to resize to during preprocessing
    - is_training(bool): to determine whether to perform a training or evaluation preprocessing

    OUTPUTS:
    - images(Tensor): a Tensor of the shape (batch_size, height, width, channels) that contain one batch of images
    - labels(Tensor): the batch's labels with the shape (batch_size,) (requires one_hot_encoding).

    '''
    # First create the data_provider object
    data_provider = slim.dataset_data_provider.DatasetDataProvider(
        dataset,
        common_queue_capacity=24 + 3 * batch_size,
        common_queue_min=24)

    # Obtain the raw image using the get method
    raw_image, label = data_provider.get(['image', 'label'])

    # Perform the correct preprocessing for this image depending if it is training or evaluating
    image = inception_preprocessing.preprocess_image(raw_image, height, width, is_training)

    # As for the raw images, we just do a simple reshape to batch it up
    raw_image = tf.expand_dims(raw_image, 0)
    raw_image = tf.image.resize_nearest_neighbor(raw_image, [height, width])
    raw_image = tf.squeeze(raw_image)

    # Batch up the image by enqueing the tensors internally in a FIFO queue and dequeueing many elements with tf.train.batch.
    images, raw_images, labels = tf.train.batch(
        [image, raw_image, label],
        batch_size=batch_size,
        num_threads=4,
        capacity=4 * batch_size,
        allow_smaller_final_batch=True)

    return images, raw_images, labels


def run():
    # Create the log directory here. Must be done here otherwise import will activate this unneededly.
    if not os.path.exists(log_dir):
        os.mkdir(log_dir)

    # ======================= TRAINING PROCESS =========================
    # Now we start to construct the graph and build our model
    with tf.Graph().as_default() as graph:
        tf.logging.set_verbosity(tf.logging.INFO)  # Set the verbosity to INFO level

        # First create the dataset and load one batch
        dataset = get_split('train', dataset_dir, file_pattern=file_pattern)
        images, _, labels = load_batch(dataset, batch_size=batch_size)

        # Know the number steps to take before decaying the learning rate and batches per epoch
        num_batches_per_epoch = int(dataset.num_samples / batch_size)
        num_steps_per_epoch = num_batches_per_epoch  # Because one step is one batch processed
        decay_steps = int(num_epochs_before_decay * num_steps_per_epoch)

        # Create the model inference
        with slim.arg_scope(inception_resnet_v2_arg_scope()):
            logits, end_points = inception_resnet_v2(images, num_classes=dataset.num_classes, is_training=True)

        # Define the scopes that you want to exclude for restoration
        exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits']
        variables_to_restore = slim.get_variables_to_restore(exclude=exclude)

        # Perform one-hot-encoding of the labels (Try one-hot-encoding within the load_batch function!)
        one_hot_labels = slim.one_hot_encoding(labels, dataset.num_classes)

        # Performs the equivalent to tf.nn.sparse_softmax_cross_entropy_with_logits but enhanced with checks
        loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits=logits)
        total_loss = tf.losses.get_total_loss()  # obtain the regularization losses as well

        # Create the global step for monitoring the learning_rate and training.
        global_step = get_or_create_global_step()

        # Define your exponentially decaying learning rate
        lr = tf.train.exponential_decay(
            learning_rate=initial_learning_rate,
            global_step=global_step,
            decay_steps=decay_steps,
            decay_rate=learning_rate_decay_factor,
            staircase=True)

        # Now we can define the optimizer that takes on the learning rate
        optimizer = tf.train.AdamOptimizer(learning_rate=lr)

        # Create the train_op.
        train_op = slim.learning.create_train_op(total_loss, optimizer)

        # State the metrics that you want to predict. We get a predictions that is not one_hot_encoded.
        predictions = tf.argmax(end_points['Predictions'], 1)
        probabilities = end_points['Predictions']
        accuracy, accuracy_update = tf.contrib.metrics.streaming_accuracy(predictions, labels)
        metrics_op = tf.group(accuracy_update, probabilities)

        # Now finally create all the summaries you need to monitor and group them into one summary op.
        tf.summary.scalar('losses/Total_Loss', total_loss)
        tf.summary.scalar('accuracy', accuracy)
        tf.summary.scalar('learning_rate', lr)
        my_summary_op = tf.summary.merge_all()

        # Now we need to create a training step function that runs both the train_op, metrics_op and updates the global_step concurrently.
        def train_step(sess, train_op, global_step):
            '''
            Simply runs a session for the three arguments provided and gives a logging on the time elapsed for each global step
            '''
            # Check the time for each sess run
            start_time = time.time()
            total_loss, global_step_count, _ = sess.run([train_op, global_step, metrics_op])
            time_elapsed = time.time() - start_time

            # Run the logging to print some results
            logging.info('global step %s: loss: %.4f (%.2f sec/step)', global_step_count, total_loss, time_elapsed)

            return total_loss, global_step_count

        # Now we create a saver function that actually restores the variables from a checkpoint file in a sess
        saver = tf.train.Saver(variables_to_restore)

        def restore_fn(sess):
            return saver.restore(sess, checkpoint_file)

        # Define your supervisor for running a managed session. Do not run the summary_op automatically or else it will consume too much memory
        sv = tf.train.Supervisor(logdir=log_dir, summary_op=None, init_fn=restore_fn)

        # Run the managed session
        with sv.managed_session() as sess:
            for step in xrange(num_steps_per_epoch * num_epochs):
                # At the start of every epoch, show the vital information:
                if step % num_batches_per_epoch == 0:
                    logging.info('Epoch %s/%s', step / num_batches_per_epoch + 1, num_epochs)
                    learning_rate_value, accuracy_value = sess.run([lr, accuracy])
                    logging.info('Current Learning Rate: %s', learning_rate_value)
                    logging.info('Current Streaming Accuracy: %s', accuracy_value)

                    # optionally, print your logits and predictions for a sanity check that things are going fine.
                    logits_value, probabilities_value, predictions_value, labels_value = sess.run(
                        [logits, probabilities, predictions, labels])
                    print 'logits: \n', logits_value
                    print 'Probabilities: \n', probabilities_value
                    print 'predictions: \n', predictions_value
                    print 'Labels:\n:', labels_value

                # Log the summaries every 10 step.
                if step % 10 == 0:
                    loss, _ = train_step(sess, train_op, sv.global_step)
                    summaries = sess.run(my_summary_op)
                    sv.summary_computed(sess, summaries)

                # If not, simply run the training step
                else:
                    loss, _ = train_step(sess, train_op, sv.global_step)

            # We log the final training loss and accuracy
            logging.info('Final Loss: %s', loss)
            logging.info('Final Accuracy: %s', sess.run(accuracy))

            # Once all the training has been done, save the log files and checkpoint model
            logging.info('Finished training! Saving model to disk now.')
            sv.saver.save(sess, sv.save_path, global_step=sv.global_step)

这段代码似乎很有效，我已经对一些样本数据进行了培训，我得到了94%的准确率

评估代码：

log_dir = './log'
log_eval = './log_eval_test'
dataset_dir = './data'
batch_size = 10
num_epochs = 1

checkpoint_file = tf.train.latest_checkpoint('./')


def run():
    if not os.path.exists(log_eval):
        os.mkdir(log_eval)
    with tf.Graph().as_default() as graph:
        tf.logging.set_verbosity(tf.logging.INFO)
        dataset = get_split('train', dataset_dir)
        images, raw_images, labels = load_batch(dataset, batch_size=batch_size, is_training=False)

        num_batches_per_epoch = dataset.num_samples / batch_size
        num_steps_per_epoch = num_batches_per_epoch

        with slim.arg_scope(inception_resnet_v2_arg_scope()):
            logits, end_points = inception_resnet_v2(images, num_classes=dataset.num_classes, is_training=False)

        variables_to_restore = slim.get_variables_to_restore()
        saver = tf.train.Saver(variables_to_restore)

        def restore_fn(sess):
            return saver.restore(sess, checkpoint_file)

        predictions = tf.argmax(end_points['Predictions'], 1)
        accuracy, accuracy_update = tf.contrib.metrics.streaming_accuracy(predictions, labels)
        metrics_op = tf.group(accuracy_update)

        global_step = get_or_create_global_step()
        global_step_op = tf.assign(global_step, global_step + 1)

        def eval_step(sess, metrics_op, global_step):
            '''
            Simply takes in a session, runs the metrics op and some logging information.
            '''
            start_time = time.time()
            _, global_step_count, accuracy_value = sess.run([metrics_op, global_step_op, accuracy])
            time_elapsed = time.time() - start_time

            logging.info('Global Step %s: Streaming Accuracy: %.4f (%.2f sec/step)', global_step_count, accuracy_value,
                         time_elapsed)

            return accuracy_value

        tf.summary.scalar('Validation_Accuracy', accuracy)
        my_summary_op = tf.summary.merge_all()

        sv = tf.train.Supervisor(logdir=log_eval, summary_op=None, saver=None, init_fn=restore_fn)

        with sv.managed_session() as sess:
            for step in xrange(num_steps_per_epoch * num_epochs):
                sess.run(sv.global_step)
                if step % num_batches_per_epoch == 0:
                    logging.info('Epoch: %s/%s', step / num_batches_per_epoch + 1, num_epochs)
                    logging.info('Current Streaming Accuracy: %.4f', sess.run(accuracy))

                if step % 10 == 0:
                    eval_step(sess, metrics_op=metrics_op, global_step=sv.global_step)
                    summaries = sess.run(my_summary_op)
                    sv.summary_computed(sess, summaries)


                else:
                    eval_step(sess, metrics_op=metrics_op, global_step=sv.global_step)

            logging.info('Final Streaming Accuracy: %.4f', sess.run(accuracy))

            raw_images, labels, predictions = sess.run([raw_images, labels, predictions])
            for i in range(10):
                image, label, prediction = raw_images[i], labels[i], predictions[i]
                prediction_name, label_name = dataset.labels_to_name[prediction], dataset.labels_to_name[label]
                text = 'Prediction: %s \n Ground Truth: %s' % (prediction_name, label_name)
                img_plot = plt.imshow(image)

                plt.title(text)
                img_plot.axes.get_yaxis().set_ticks([])
                img_plot.axes.get_xaxis().set_ticks([])
                plt.show()

            logging.info(
                'Model evaluation has completed! Visit TensorBoard for more information regarding your evaluation.')

因此，在训练模型并获得94%的准确率后，我尝试评估模型。在评估中，我始终获得0-1%的准确率。我对此进行了调查，结果发现每次预测的都是同一个班级

labels: [7, 11, 5, 1, 20, 0, 18, 1, 0, 7]
predictions: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

有人能帮我解决我可能出的问题吗

编辑：

张力球准确性与失形训练

从评估中得出的张力板精度

编辑：

我仍然无法解决这个问题。我认为在eval脚本中如何恢复图形可能会有问题，所以我尝试使用它来恢复模型

saver = tf.train.import_meta_graph('/log/model.ckpt.meta')

def restore_fn(sess):
    return saver.restore(sess, checkpoint_file)

而不是

variables_to_restore = slim.get_variables_to_restore()
    saver = tf.train.Saver(variables_to_restore)

def restore_fn(sess):
    return saver.restore(sess, checkpoint_file)

只是需要很长的时间来开始，最后是错误。然后我尝试在saver中使用writer的V1（

saver=tf.train.saver（variables\u to\u restore，write\u version=saver\u pb2.SaveDef.V1）

）并进行了重新训练，但根本无法加载此检查点，因为它说缺少变量

我还尝试使用它训练的相同数据运行我的eval脚本，只是为了看看这是否会产生不同的结果，但我得到的结果是相同的

最后，我从url中重新克隆了repo，并在教程中使用相同的数据集运行了一次训练，即使在训练时将其提高到84%，我在评估时也能获得0-3%的准确率。此外，我的检查点必须有正确的信息，因为当我重新开始训练时，准确性将从离开的地方继续。当我恢复模型时，感觉我没有正确地执行某些操作。如果我理解正确的话，首先你想做21类分类器。也许你的代码是正确的，但是你没有正确地分割数据。如果所有的类都被表示，你应该签入你的训练数据

如果您的培训数据仅来自一个班级（可能您只采集了一个非常小的数据样本进行实验，并且只采集了第10班的图像）您将获得类似的结果，在训练中具有很高的准确性，但在预测时，分类器将仅预测类别10，从而提供接近零的测试精度。我终于设法解决了我的问题。这听起来很奇怪，但加载模型时的is_训练参数需要在训练脚本和eval sc上设置为False或者两者都必须为真。这是因为当is_training为假时，会删除批次标准化

这可以通过tensorflow/tensorflow github中的此线程进行验证

同样在这个超薄的穿行Jupyter笔记本上

如果您滚动到页面末尾，进入标题为“将微调模型应用于某些图像”的部分，您将看到一个代码块，显示如何重新加载微调的、经过预训练的模型。当他们加载模型时，您将看到这一行以及注释说明

# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, _ = inception.inception_v1(images, 
num_classes=dataset.num_classes, is_training=True)

尽管这是Inception_v1，但原理是相同的，这表明将两者设置为False或True都可以工作，但如果不在slim中编辑Inception_resnet_v2.py代码，则无法将两者设置为不同的值。实际上，此问题是由BN更新方法引起的。默认情况下，tf不会更新均值和var参数

从官方API文档：

注意：训练时，移动平均值和移动方差需要更新。默认情况下，更新操作放置在tf.GraphKeys.update\u ops中，因此需要将它们作为依赖项添加到训练操作中。此外，确保在获取更新操作集合之前添加任何批处理操作。否则，更新操作将为空，并且训练/推断这种方法不能正常工作

解决方案：

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

感谢您的回复，我正在创建一个正确的21类分类器。我的数据集具有每个类相同数量的图像。此外，当我运行“训练”时，代码会打印标签与预测，这表明它在我评估时适用于多个类。来自“训练<代码>预测”的预测示例：[17 13 7 6 13 20 19 3 15 0 18 15 10 11 19 3]标签：[17 13 7 6 13 20 19 3 15 0 18 15 10 11 19 3]你有多少数据？你在模型上输入了多少？你能发布你在达到94%准确率的每个历元的结果吗？你确定你加载了正确的模型吗？我会尝试输入checkpoint_file=“…”训练代码中的完整路径我正在处理实际数据的一小部分，每个类有20个图像。不幸的是，我不再有显示预测的终端输出，但我添加了显示从tensorboard到checkpoint_文件中丢失的准确度的图形，完整路径仍然相同：(