Python 为什么在Keras培训期间,model.evaluate()计算的指标与跟踪的指标不同?
我使用Keras2.0.4(TensorFlow后端)进行图像分类任务(基于预训练模型)。 在培训/调整期间,我使用Python 为什么在Keras培训期间,model.evaluate()计算的指标与跟踪的指标不同?,python,python-2.7,keras,metrics,Python,Python 2.7,Keras,Metrics,我使用Keras2.0.4(TensorFlow后端)进行图像分类任务(基于预训练模型)。 在培训/调整期间,我使用CSVLogger-跟踪所有使用的度量(例如category\u accurity,category crossentropy),包括与验证集相关联的相应度量(即val\u category\u accurity,val\u categegority\u crossentropy) 通过回调ModelCheckpoint我正在跟踪权重的最佳配置(save\u best\u only
CSVLogger
-跟踪所有使用的度量(例如category\u accurity
,category crossentropy
),包括与验证集相关联的相应度量(即val\u category\u accurity
,val\u categegority\u crossentropy
)
通过回调ModelCheckpoint
我正在跟踪权重的最佳配置(save\u best\u only=True
)。为了评估验证集中的模型,我使用model.evaluate()
我的期望是:CSVLogger
(最佳纪元)跟踪的指标等于model.evaluate()计算的指标。
不幸的是,情况并非如此。指标差异为+-5%。
这种行为有什么原因吗
电子数据交换:
经过一些测试,我可以获得一些见解:
如果我没有将生成器用于培训和验证数据(因此没有model.fit\u generator()
),则不会出现问题。-->使用ImageDataGenerator
进行培训和验证数据是差异的来源。(请注意,在计算评估I时,不要使用生成器,但我do使用相同的验证数据(至少如果DataImageGenerator
能够按预期工作……。
我认为ImageDataGenerator不能正常工作(请,
还可以看一看)
如果我根本不使用生成器,就不会有这个问题。Id est通过CSVLogger
(最佳”纪元)跟踪的度量等于通过model.evaluate()计算的度量值
有趣的是,还有另一个问题:如果您使用相同的数据进行培训和验证,那么在每个历元结束时,培训指标(例如损失
)和验证指标(例如val\u损失
)之间会存在差异。
()
使用的代码:
############################ import section ############################
from __future__ import print_function # perform like in python 3.x
from keras.datasets import mnist
from keras.utils import np_utils # numpy utils for to_categorical()
from keras.models import Model, load_model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, GaussianDropout, Conv2D, MaxPooling2D
from keras.optimizers import SGD, Adam
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator
from keras import metrics
import os
import sys
from scipy import misc
import numpy as np
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.applications import VGG16
from keras.callbacks import CSVLogger, ModelCheckpoint
############################ manual settings ###########################
# general settings
seed = 1337
loss_function = 'categorical_crossentropy'
learning_rate = 0.001
epochs = 10
batch_size = 20
nb_classes = 5
img_width, img_height = 400, 400 # >= 48 necessary, as VGG16 is used
chosen_optimizer = SGD(lr=learning_rate, momentum=0.0, decay=0.0, nesterov=False)
steps_per_epoch = 40 // batch_size # 40 train samples in 5 classes
validation_steps = 40 // batch_size # 40 train samples in 5 classes
data_dir = # TODO: set path where data is stored (folders: 'train', 'val', 'test'; within each folder are folders named by classes)
# callbacks: CSVLogger & ModelCheckpoint
filepath = # TODO: set path, where you want to store files generated by the callbacks
file_best_checkpoint= 'best_epoch.hdf5'
file_csvlogger = 'logged_metrics.txt'
modelcheckpoint_best_epoch= ModelCheckpoint(filepath=os.path.join(filepath, file_best_checkpoint),
monitor = 'val_loss' , verbose = 1,
save_best_only = True,
save_weights_only=False, mode='auto',
period=1) # every epoch executed
csvlogger = CSVLogger(os.path.join(filepath, file_csvlogger) , separator=',', append=False)
############################ prepare data ##############################
# get validation data (for evaluation)
X_val, Y_val = # TODO: load train data (4darray, samples, img_width, img_height, nb_channels) IMPORTANT: 5 classes with 8 images each.
# preprocess data
my_preprocessing_function = mf.my_vgg16_preprocess_input
# 'augmentation' configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set
# 'augmentation' configuration we will use for validation
val_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set
train_data_dir = os.path.join(data_dir, 'train')
validation_data_dir = os.path.join(data_dir, 'val')
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
seed = seed, # random seed for shuffling and transformations
class_mode='categorical') # label type (categorical = one-hot vector)
validation_generator = val_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
seed = seed, # random seed for shuffling and transformations
class_mode='categorical') # label type (categorical = one-hot vector)
############################## training ###############################
print("\n---------------------------------------------------------------")
print("------------------------ training model -----------------------")
print("---------------------------------------------------------------")
# create the base pre-trained model
base_model = VGG16(include_top=False, weights = None, input_shape=(img_width, img_height, 3), pooling = 'max', classes = nb_classes)
model_name = "VGG_modified"
# do not freeze any layers --> all layers trainable
for layer in base_model.layers:
layer.trainable = True
# define topping of base_model
x = base_model.output # get the last layer of our base_model
x = Dense(1024, activation='relu', name='fc1')(x)
x = Dense(1024, activation='relu', name='fc2')(x)
predictions = Dense(nb_classes, activation='softmax', name='predictions')(x)
# finally, stack model together
model = Model(outputs=predictions, name= model_name, inputs=base_model.input) #Keras 1.x.x: model = Model(input=base_model.input, output=predictions)
print(model.summary())
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer = chosen_optimizer, loss=loss_function,
metrics=['categorical_accuracy','kullback_leibler_divergence'])
# train the model on your data
model.fit_generator(
train_generator,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
validation_data=validation_generator,
validation_steps=validation_steps,
callbacks = [csvlogger, modelcheckpoint_best_epoch])
############################## evaluation ##############################
print("\n\n---------------------------------------------------------------")
print("------------------ Evaluation of Best Epoch -------------------")
print("---------------------------------------------------------------")
# load model (corresponding to best training epoch)
model = load_model(os.path.join(filepath, file_best_checkpoint))
# evaluate model on validation data (in test mode!)
list_of_metrics = model.evaluate(X_val, Y_val, batch_size=batch_size, verbose=1, sample_weight=None)
index = 0
print('\nMetrics:')
for metric in model.metrics_names:
print(metric+ ':' , str(list_of_metrics[index]))
index += 1
E D I T 2
参见第1节的内容:
如果我在培训和评估期间使用相同的生成器进行验证数据(通过使用evaluate\u generator()
),问题仍然会发生。
因此,这肯定是由生成器引起的问题…仅对验证数据集上的度量进行评估
在培训期间,在培训数据集上计算的度量值并不反映该模型在该纪元结束时的真实度量值,因为该模型将在每个批次更新(修改)
这有帮助吗?CSVLogger在每个历元后跟踪验证集上的度量。我们假设,最后一个历元将导致权重的最佳配置。这意味着,验证集上最后跟踪的度量是在验证集上进行评估时的度量。我遗漏了什么?嗯,用于保存最佳onl的度量是什么y?监控数量是验证丢失(val\u categorical\u crossentropy
)事实上这不重要……很抱歉,我也被困在这个案例中。理想情况下,您应该提出一些代码,以便我们重现您的问题并帮助解决问题:-)我跟踪了问题。请参见上面的编辑