Tensorflow 如何设置用于微调的检查点_Tensorflow_Object Detection Api

Tensorflow 如何设置用于微调的检查点

tensorflow

Tensorflow 如何设置用于微调的检查点,tensorflow,object-detection-api,Tensorflow,Object Detection Api,在开始训练时，我发现从model_zoo重新训练模型（ssd_MobileNet V2）时损失很大，而验证集的准确性很好。培训日志如下：日志无法来自经过培训的模型。我怀疑它没有加载检查点来进行微调。请帮助我如何在同一数据集上对经过训练的模型进行微调。我根本没有修改网络结构我在pipeline.config中设置了检查点路径，如下所示：微调检查点：“//ssd\u mobilenet\u v2\u coco\u 2018\u 03\u 29/model.ckpt” 如果我将model_dir

在开始训练时，我发现从model_zoo重新训练模型（ssd_MobileNet V2）时损失很大，而验证集的准确性很好。培训日志如下：

日志无法来自经过培训的模型。我怀疑它没有加载检查点来进行微调。请帮助我如何在同一数据集上对经过训练的模型进行微调。我根本没有修改网络结构

我在pipeline.config中设置了检查点路径，如下所示：微调检查点：“//ssd\u mobilenet\u v2\u coco\u 2018\u 03\u 29/model.ckpt” 如果我将model_dir设置为我的下载目录，它将不会训练，因为全局_train_步长大于max_步长。然后放大max_步骤，可以看到从检查点恢复参数的日志。但它会遇到无法恢复某些参数的错误。因此，我将model_dir设置为空目录。它可以正常训练，但第0步的损失将非常大。验证结果非常糟糕

在pipeline.config中训练脚本信息：tensorflow:损耗=356.25497，步长=0 信息：tensorflow:全局步长/秒：1.89768 信息：tensorflow：损耗=11.221423，步长=100（52.700秒）信息：tensorflow:全局步长/秒：2.21685

信息：tensorflow:loss=10.329516，step=200（45.109秒）

如果初始训练损失为400，则模型很可能从检查点成功恢复，只是与检查点不完全相同

是

ssd

型号的

restore\u map

功能，请注意，即使您设置了

fine\u tune\u checkpoint\u type:detection

并提供了完全相同型号的检查点，仍然只恢复

feature\u extractor

范围内的变量。要从检查点恢复尽可能多的变量，必须在配置文件中设置

load\u all\u detection\u checkpoint\u vars:true

def restore_map(self,
              fine_tune_checkpoint_type='detection',
              load_all_detection_checkpoint_vars=False):

if fine_tune_checkpoint_type not in ['detection', 'classification']:
  raise ValueError('Not supported fine_tune_checkpoint_type: {}'.format(
      fine_tune_checkpoint_type))

if fine_tune_checkpoint_type == 'classification':
  return self._feature_extractor.restore_from_classification_checkpoint_fn(
      self._extract_features_scope)

if fine_tune_checkpoint_type == 'detection':
  variables_to_restore = {}
  for variable in tf.global_variables():
    var_name = variable.op.name
    if load_all_detection_checkpoint_vars:
      variables_to_restore[var_name] = variable
    else:
      if var_name.startswith(self._extract_features_scope):
        variables_to_restore[var_name] = variable

return variables_to_restore

最好将完整的配置文件粘贴到这里。例如，如果将来自检查点的

类型设置为分类，一些参数将不会恢复到检查点，因此在开始时损失可能会很高。感谢您的回复，我刚刚填充了pipeline.config内容，从model zoo下载后，除了数据集和标签的路径外，无法修改管道文件。还有其他地方需要修改吗？我怎么知道模型的参数已经加载了呢？你可以做的是不使用检查点，而只是从头开始训练，然后你可以在设置检查点时比较结果是否有任何差异。我希望这次的训练损失要大得多（多个订单），谢谢你的建议。我移除了检查点，发现第一次丢失大于400。还有一个警告是找不到检查点。虽然我已经正确加载了检查点，但我还是很困惑为什么第一次的损失如此之大。我只是将val_步长设置为1，并且学习率足够小。加载检查点之后。它将在每一次国际热核实验堆（iter）中得到验证。但我发现AP都是零。这表明预先训练的模型没有正确加载。非常感谢。我现在可以得到正确的损失。顺便说一句，你能告诉我你附加的文件的位置在哪里吗。我想用morpnet修改网络。如果你对此有什么建议，请告诉我。谢谢
model_dir = '/ssd_mobilenet_v2_coco_2018_03_29/retrain0524

pipeline_config_path = '/ssd_mobilenet_v2_coco_2018_03_29/pipeline.config'

checkpoint_dir = '/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt'

num_train_steps = 300000
config = tf.estimator.RunConfig(model_dir=model_dir)
train_and_eval_dict = model_lib.create_estimator_and_inputs(
    run_config=config,
    hparams=model_hparams.create_hparams(hparams_overrides),
    pipeline_config_path=pipeline_config_path,    
    sample_1_of_n_eval_examples=sample_1_of_n_eval_examples,
    sample_1_of_n_eval_on_train_examples=(sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']

train_spec, eval_specs = model_lib.create_train_and_eval_specs(
        train_input_fn,
        eval_input_fns,
        eval_on_train_input_fn,
        predict_input_fn,
        train_steps,
        eval_on_train_data=False)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])

def restore_map(self,
              fine_tune_checkpoint_type='detection',
              load_all_detection_checkpoint_vars=False):

if fine_tune_checkpoint_type not in ['detection', 'classification']:
  raise ValueError('Not supported fine_tune_checkpoint_type: {}'.format(
      fine_tune_checkpoint_type))

if fine_tune_checkpoint_type == 'classification':
  return self._feature_extractor.restore_from_classification_checkpoint_fn(
      self._extract_features_scope)

if fine_tune_checkpoint_type == 'detection':
  variables_to_restore = {}
  for variable in tf.global_variables():
    var_name = variable.op.name
    if load_all_detection_checkpoint_vars:
      variables_to_restore[var_name] = variable
    else:
      if var_name.startswith(self._extract_features_scope):
        variables_to_restore[var_name] = variable

return variables_to_restore