Python 我试图从某个检查点(Tensorflow)恢复训练,因为我';我使用Colab,12小时不';还不够

Python 我试图从某个检查点(Tensorflow)恢复训练,因为我';我使用Colab,12小时不';还不够,python,tensorflow,google-colaboratory,training-data,checkpoint,Python,Tensorflow,Google Colaboratory,Training Data,Checkpoint,这是我正在使用的代码的一部分 checkpoint_dir = 'training_checkpoints1' checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt") checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder,

这是我正在使用的代码的一部分

checkpoint_dir = 'training_checkpoints1'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                             encoder=encoder,
                             decoder=decoder)
现在是训练部分

EPOCHS = 900

for epoch in range(EPOCHS):
  start = time.time()

  hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset):
      loss = 0
    
      with tf.GradientTape() as tape:
          enc_output, enc_hidden = encoder(inp, hidden)
        
          dec_hidden = enc_hidden
        
          dec_input = tf.expand_dims([targ_lang.word2idx['<start>']] * batch_size, 1)       
        
          # Teacher forcing - feeding the target as the next input
          for t in range(1, targ.shape[1]):
              # passing enc_output to the decoder
              predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
            
              loss += loss_function(targ[:, t], predictions)
            
              # using teacher forcing
              dec_input = tf.expand_dims(targ[:, t], 1)
    
      batch_loss = (loss / int(targ.shape[1]))
    
      total_loss += batch_loss
    
      variables = encoder.variables + decoder.variables
    
      gradients = tape.gradient(loss, variables)
    
      optimizer.apply_gradients(zip(gradients, variables))
    
      if batch % 100 == 0:
          print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                     batch,
                                                     batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                    total_loss / num_batches))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
结果

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f6653263048>

您应该在开始时创建一个检查点管理器,如下所示:

checkpoint_path = os.path.abspath('.') + "/checkpoints"   # Put your path here
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
现在,在运行了几个历元之后,要恢复最新的检查点,您应该从
检查点管理器
获取最新的检查点:

start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # restoring the latest checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)

这将从最新的历元恢复会话。

您应该在开始时创建一个检查点管理器,如下所示:

checkpoint_path = os.path.abspath('.') + "/checkpoints"   # Put your path here
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
现在,在运行了几个历元之后,要恢复最新的检查点,您应该从
检查点管理器
获取最新的检查点:

start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # restoring the latest checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)

这将从最新时代恢复您的会话。

欢迎使用SO!你能试着更具体地说明你的问题是什么吗?考虑让代码片段更加集中,创建一个最小的、可重复的例子。欢迎来到!你能试着更具体地说明你的问题是什么吗?考虑使代码片段更加集中,以创建一个最小的、可重复的示例。