如何在Tensorflow 2中的模型培训期间捕获任何异常

如何在Tensorflow 2中的模型培训期间捕获任何异常,tensorflow,tensorflow-datasets,tensorflow2.0,model-fitting,Tensorflow,Tensorflow Datasets,Tensorflow2.0,Model Fitting,我正在用Tensorflow训练一个Unet模型。如果我传递给模型进行训练的任何图像有问题,将抛出异常。有时,这种情况会在训练一到两小时后发生。是否有可能在将来捕获任何此类异常,以便我的模型可以继续下一张图像并恢复训练?我已尝试将try/catch块添加到下面显示的process\u path函数中,但这没有效果 def process_path(filePath): # catching exceptions here has no effect parts = tf.stri

我正在用Tensorflow训练一个Unet模型。如果我传递给模型进行训练的任何图像有问题,将抛出异常。有时,这种情况会在训练一到两小时后发生。是否有可能在将来捕获任何此类异常,以便我的模型可以继续下一张图像并恢复训练?我已尝试将
try/catch
块添加到下面显示的
process\u path
函数中,但这没有效果

def process_path(filePath):
    # catching exceptions here has no effect
    parts = tf.strings.split(filePath, '/')
    fileName = parts[-1]
    parts = tf.strings.split(fileName, '.')
    prefix = tf.convert_to_tensor(maskDir, dtype=tf.string)
    suffix = tf.convert_to_tensor("-mask.png", dtype=tf.string)
    maskFileName = tf.strings.join((parts[-2], suffix))
    maskPath = tf.strings.join((prefix, maskFileName), separator='/')

    # load the raw data from the file as a string
    img = tf.io.read_file(filePath)
    img = decode_img(img)
    mask = tf.io.read_file(maskPath)
    oneHot = decodeMask(mask)
    img.set_shape([256, 256, 3])
    oneHot.set_shape([256, 256, 10])
    return img, oneHot

trainSize = int(0.7 * DATASET_SIZE)
validSize = int(0.3 * DATASET_SIZE)
batchSize = 32

allDataSet = tf.data.Dataset.list_files(str(imageDir + "/*"))

trainDataSet = allDataSet.take(trainSize)
trainDataSet = trainDataSet.shuffle(1000).repeat()
trainDataSet = trainDataSet.map(process_path, num_parallel_calls=tf.data.experimental.AUTOTUNE)
trainDataSet = trainDataSet.batch(batchSize)
trainDataSet = trainDataSet.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

validDataSet = allDataSet.skip(trainSize)
validDataSet = validDataSet.shuffle(1000).repeat()
validDataSet = validDataSet.map(process_path)
validDataSet = validDataSet.batch(batchSize)

imageHeight = 256
imageWidth = 256
channels = 3

inputImage = Input((imageHeight, imageWidth, channels), name='img') 
model = baseUnet.get_unet(inputImage, n_filters=16, dropout=0.05, batchnorm=True)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

callbacks = [
    EarlyStopping(patience=5, verbose=1),
    ReduceLROnPlateau(factor=0.1, patience=5, min_lr=0.00001, verbose=1),
    ModelCheckpoint(outputModel, verbose=1, save_best_only=True, save_weights_only=False)
]

BATCH_SIZE = 32
BUFFER_SIZE = 1000
EPOCHS = 20

stepsPerEpoch = int(trainSize / BATCH_SIZE)
validationSteps = int(validSize / BATCH_SIZE)

model_history = model.fit(trainDataSet, epochs=EPOCHS,
                          steps_per_epoch=stepsPerEpoch,
                          validation_steps=validationSteps,
                          validation_data=validDataSet,
                          callbacks=callbacks)
下面显示了一个类似的情况,并解释了“Python函数只执行一次以构建函数图,而try和except语句在这方面没有任何效果。”尽管该链接显示了如何遍历数据集并捕获错误

dataset = ...
iterator = iter(dataset)

while True:
  try:
    elem = next(iterator)
    ...
  except InvalidArgumentError:
    ...
  except StopIteration:
    break
…然而,我正在寻找一种方法来捕捉训练中的错误。这可能吗