Pytorch Pytork lightning中未执行培训步骤

Pytorch Pytork lightning中未执行培训步骤,pytorch,pytorch-lightning,Pytorch,Pytorch Lightning,我正在对t5模型进行微调,以总结亚马逊的评论。我将在这里学习本教程: 我注意到我的代码中的training_步骤从来没有被执行过,因为在整个时间段中,training丢失仍然是NaN。但是,验证步骤的计算结果很好 我已经确认数据中没有空字符串,并尝试了多个批量大小 这就是错误所在 RuntimeError Traceback (most recent call last) <ipython-input-53-45d4afebefac

我正在对t5模型进行微调,以总结亚马逊的评论。我将在这里学习本教程:

我注意到我的代码中的training_步骤从来没有被执行过,因为在整个时间段中,training丢失仍然是NaN。但是,验证步骤的计算结果很好

我已经确认数据中没有空字符串,并尝试了多个批量大小

这就是错误所在

RuntimeError                              Traceback (most recent call last)
<ipython-input-53-45d4afebefac> in <module>()
----> 1 trainer.fit(model)

8 frames
<ipython-input-46-00fddffa2209> in training_epoch_end(self, outputs)
    134         print("OUTPUTS")
    135         print(outputs)
--> 136         avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
    137         tensorboard_logs = {"avg_train_loss": avg_train_loss}
    138         return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

RuntimeError: stack expects a non-empty TensorList
以下是我的参数:

args_dict = dict(
    output_dir="", # path to save the checkpoints
    model_name_or_path='t5-small',
    tokenizer_name_or_path='t5-small',
    max_input_length=512,
    max_output_length=150,
    freeze_encoder=False,
    freeze_embeds=False,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=0,
    train_batch_size=20,
    eval_batch_size=20,
    num_train_epochs=2,
    gradient_accumulation_steps=8,
    n_gpu=1,
    resume_from_checkpoint=None, 
    val_check_interval = 0.05, 
    n_val=1000,
    n_train=-1,
    n_test=-1,
    early_stop_callback=False,
    fp_16=False, # if you want to enable 16-bit training then install apex and set this to true
    opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
    max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
    seed=42,
)

这个代码似乎已经过时了。造成这种冲突的是optimizer_步骤方法。我刚刚把下面的这一部分注释掉了,它对我很有用。如果要在此函数中执行任何自定义逻辑,最好参考上的最新代码


嘿,有什么最新消息吗?我遇到了同样的问题
args_dict = dict(
    output_dir="", # path to save the checkpoints
    model_name_or_path='t5-small',
    tokenizer_name_or_path='t5-small',
    max_input_length=512,
    max_output_length=150,
    freeze_encoder=False,
    freeze_embeds=False,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=0,
    train_batch_size=20,
    eval_batch_size=20,
    num_train_epochs=2,
    gradient_accumulation_steps=8,
    n_gpu=1,
    resume_from_checkpoint=None, 
    val_check_interval = 0.05, 
    n_val=1000,
    n_train=-1,
    n_test=-1,
    early_stop_callback=False,
    fp_16=False, # if you want to enable 16-bit training then install apex and set this to true
    opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
    max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
    seed=42,
)
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None, using_native_amp=False,on_tpu=None,using_lbfgs=None, optimizer_closure=None):
        if self.trainer.use_tpu:
            xm.optimizer_step(optimizer)
        else:
            optimizer.step(closure=optimizer_closure)
        optimizer.zero_grad()
        self.lr_scheduler.step()