Tensorflow 试图在Google Colab上训练ResNet时出现ResourceExhustederRor

Tensorflow 试图在Google Colab上训练ResNet时出现ResourceExhustederRor,tensorflow,keras,deep-learning,google-colaboratory,Tensorflow,Keras,Deep Learning,Google Colaboratory,我试图在一个自定义数据集上对Google Colab上的ResNet56进行训练,其中每个图像的尺寸为299x299x1。以下是我得到的错误: ResourceExhaustedError: OOM when allocating tensor with shape[32,16,299,299] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node re

我试图在一个自定义数据集上对Google Colab上的ResNet56进行训练,其中每个图像的尺寸为299x299x1。以下是我得到的错误:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,16,299,299] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node resnet/conv2d_21/Conv2D (defined at <ipython-input-15-3b824ba8fe2a>:3) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_21542]

Function call stack:
train_function

有什么想法吗?

如果内存不足,你可以做的事情就不多了

我能想到的是

  • 减少批量大小
  • 减少图像输入大小
  • 若你们选择减少批量大小,那个么你们可能也需要降低学习率,若你们觉得它不收敛的话


    附言:如果你把动量放在那里,SGD会做得更好,比如说SGD(lr=1e-1,动量=0.9)
    我也得到了同样的错误,这是因为大图像大小或大批量我使用的图像大小是512*512,批量大小是10。
    我将批量大小减少到2,它开始对我起作用。

    减少
    批量大小也仅供参考,如果你想使用
    SGD
    ,那么也将
    动量
    放在那里。SGD在动量方面要好得多。减少批量实际上是有效的。你能评论一下这个帖子吗?这样我就可以接受你的回答了。谢谢你的建议!我发了一封长信
    TRAINING_SIZE = 9287
    VALIDATION_SIZE = 1194
    
    AUTO = tf.data.experimental.AUTOTUNE # used in tf.data.Dataset API
    BATCH_SIZE = 32
    
    model_checkpoint_path = "/content/drive/My Drive/Patch Classifier/Data/patch_classifier_checkpoint"
    if not os.path.exists(model_checkpoint_path):
        os.mkdir(model_checkpoint_path)
    
    CALLBACKS = [
                  EpochCheckpoint(model_checkpoint_path, every=2, startAt=0),
                  TrainingMonitor("/content/drive/My Drive/Patch Classifier/Training/resnet56.png",
                                  jsonPath="/content/drive/My Drive/Patch Classifier/Training/resnet56",
                                  startAt=0)
                  ]
    
    compute_steps_per_epoch = lambda x: int(math.ceil(1. * x / BATCH_SIZE))
    steps_per_epoch = compute_steps_per_epoch(TRAINING_SIZE)
    val_steps = compute_steps_per_epoch(VALIDATION_SIZE)
    
    opt = SGD(lr=1e-1)
    model = ResNet.build(299, 299, 1, 5, (9, 9, 9), (64, 64, 128, 256), reg=0.005)
    model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
    
    history = model.fit(get_batched_dataset("/content/drive/My Drive/Patch Classifier/Data/patch_classifier_train_0.tfrecords"), steps_per_epoch=steps_per_epoch, epochs=10,
                        validation_data=get_batched_dataset("/content/drive/My Drive/Patch Classifier/Data/patch_classifier_val_0.tfrecords"), validation_steps=val_steps,
                        callbacks=CALLBACKS)