Pytorch Pytork lightning中未执行培训步骤
我正在对t5模型进行微调,以总结亚马逊的评论。我将在这里学习本教程: 我注意到我的代码中的training_步骤从来没有被执行过,因为在整个时间段中,training丢失仍然是NaN。但是,验证步骤的计算结果很好 我已经确认数据中没有空字符串,并尝试了多个批量大小 这就是错误所在Pytorch Pytork lightning中未执行培训步骤,pytorch,pytorch-lightning,Pytorch,Pytorch Lightning,我正在对t5模型进行微调,以总结亚马逊的评论。我将在这里学习本教程: 我注意到我的代码中的training_步骤从来没有被执行过,因为在整个时间段中,training丢失仍然是NaN。但是,验证步骤的计算结果很好 我已经确认数据中没有空字符串,并尝试了多个批量大小 这就是错误所在 RuntimeError Traceback (most recent call last) <ipython-input-53-45d4afebefac
RuntimeError Traceback (most recent call last)
<ipython-input-53-45d4afebefac> in <module>()
----> 1 trainer.fit(model)
8 frames
<ipython-input-46-00fddffa2209> in training_epoch_end(self, outputs)
134 print("OUTPUTS")
135 print(outputs)
--> 136 avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
137 tensorboard_logs = {"avg_train_loss": avg_train_loss}
138 return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}
RuntimeError: stack expects a non-empty TensorList
以下是我的参数:
args_dict = dict(
output_dir="", # path to save the checkpoints
model_name_or_path='t5-small',
tokenizer_name_or_path='t5-small',
max_input_length=512,
max_output_length=150,
freeze_encoder=False,
freeze_embeds=False,
learning_rate=3e-4,
weight_decay=0.0,
adam_epsilon=1e-8,
warmup_steps=0,
train_batch_size=20,
eval_batch_size=20,
num_train_epochs=2,
gradient_accumulation_steps=8,
n_gpu=1,
resume_from_checkpoint=None,
val_check_interval = 0.05,
n_val=1000,
n_train=-1,
n_test=-1,
early_stop_callback=False,
fp_16=False, # if you want to enable 16-bit training then install apex and set this to true
opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
seed=42,
)
这个代码似乎已经过时了。造成这种冲突的是optimizer_步骤方法。我刚刚把下面的这一部分注释掉了,它对我很有用。如果要在此函数中执行任何自定义逻辑,最好参考上的最新代码
嘿,有什么最新消息吗?我遇到了同样的问题
args_dict = dict(
output_dir="", # path to save the checkpoints
model_name_or_path='t5-small',
tokenizer_name_or_path='t5-small',
max_input_length=512,
max_output_length=150,
freeze_encoder=False,
freeze_embeds=False,
learning_rate=3e-4,
weight_decay=0.0,
adam_epsilon=1e-8,
warmup_steps=0,
train_batch_size=20,
eval_batch_size=20,
num_train_epochs=2,
gradient_accumulation_steps=8,
n_gpu=1,
resume_from_checkpoint=None,
val_check_interval = 0.05,
n_val=1000,
n_train=-1,
n_test=-1,
early_stop_callback=False,
fp_16=False, # if you want to enable 16-bit training then install apex and set this to true
opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
seed=42,
)
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None, using_native_amp=False,on_tpu=None,using_lbfgs=None, optimizer_closure=None):
if self.trainer.use_tpu:
xm.optimizer_step(optimizer)
else:
optimizer.step(closure=optimizer_closure)
optimizer.zero_grad()
self.lr_scheduler.step()