Python Keras与Tensorflow后端---model.fit()中的Memoryerror与检查点回调
我在试着训练一个自动编码器。它在model.fit()上不断从Keras获取Memoryerror,当我向model.fit添加与验证相关的参数时,它总是发生,就像验证分割一样 错误:Python Keras与Tensorflow后端---model.fit()中的Memoryerror与检查点回调,python,tensorflow,keras,Python,Tensorflow,Keras,我在试着训练一个自动编码器。它在model.fit()上不断从Keras获取Memoryerror,当我向model.fit添加与验证相关的参数时,它总是发生,就像验证分割一样 错误: data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)), 'data') snapshot = ModelCheckpoint(
data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)),
'data')
snapshot = ModelCheckpoint(os.path.join(job_folder,
'model_snapshot_e{epoch:03d}_{val_loss:.6f}.h5'))
earlystop = EarlyStopping(patience=10)
history_log = LossHistory(job_folder=job_folder, logger=logger)
logger.info("Initializing training...")
history = model.fit(
data,
data,
batch_size=batch_size,
epochs=nb_epoch,
validation_split=0.15,
shuffle='batch',
callbacks=[snapshot, earlystop, history_log]
)
input_tensor = Input(shape=(t, 224, 224, 1))
conv1 = TimeDistributed(Conv2D(128, kernel_size=(11, 11), padding='same', strides=(4, 4), name='conv1'),
input_shape=(t, 224, 224, 1))(input_tensor)
conv1 = TimeDistributed(BatchNormalization())(conv1)
conv1 = TimeDistributed(Activation('relu'))(conv1)
conv2 = TimeDistributed(Conv2D(64, kernel_size=(5, 5), padding='same', strides=(2, 2), name='conv2'))(conv1)
conv2 = TimeDistributed(BatchNormalization())(conv2)
conv2 = TimeDistributed(Activation('relu'))(conv2)
convlstm1 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm1')(conv2)
convlstm2 = ConvLSTM2D(32, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm2')(convlstm1)
convlstm3 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm3')(convlstm2)
deconv1 = TimeDistributed(Conv2DTranspose(128, kernel_size=(5, 5), padding='same', strides=(2, 2), name='deconv1'))(convlstm3)
deconv1 = TimeDistributed(BatchNormalization())(deconv1)
deconv1 = TimeDistributed(Activation('relu'))(deconv1)
decoded = TimeDistributed(Conv2DTranspose(1, kernel_size=(11, 11), padding='same', strides=(4, 4), name='deconv2'))(
deconv1)
当我删除model.fit中的validation_split=0.15和回调中的snapshot时,代码将正确运行
数据变量包含来自训练数据集的所有已处理图像,
它的形状是(15200,82242241),大小是6101401600
这段代码用在有64GB RAM和特斯拉P100的计算机上,不用担心内存空间,我的python是64位的
型号:
data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)),
'data')
snapshot = ModelCheckpoint(os.path.join(job_folder,
'model_snapshot_e{epoch:03d}_{val_loss:.6f}.h5'))
earlystop = EarlyStopping(patience=10)
history_log = LossHistory(job_folder=job_folder, logger=logger)
logger.info("Initializing training...")
history = model.fit(
data,
data,
batch_size=batch_size,
epochs=nb_epoch,
validation_split=0.15,
shuffle='batch',
callbacks=[snapshot, earlystop, history_log]
)
input_tensor = Input(shape=(t, 224, 224, 1))
conv1 = TimeDistributed(Conv2D(128, kernel_size=(11, 11), padding='same', strides=(4, 4), name='conv1'),
input_shape=(t, 224, 224, 1))(input_tensor)
conv1 = TimeDistributed(BatchNormalization())(conv1)
conv1 = TimeDistributed(Activation('relu'))(conv1)
conv2 = TimeDistributed(Conv2D(64, kernel_size=(5, 5), padding='same', strides=(2, 2), name='conv2'))(conv1)
conv2 = TimeDistributed(BatchNormalization())(conv2)
conv2 = TimeDistributed(Activation('relu'))(conv2)
convlstm1 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm1')(conv2)
convlstm2 = ConvLSTM2D(32, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm2')(convlstm1)
convlstm3 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm3')(convlstm2)
deconv1 = TimeDistributed(Conv2DTranspose(128, kernel_size=(5, 5), padding='same', strides=(2, 2), name='deconv1'))(convlstm3)
deconv1 = TimeDistributed(BatchNormalization())(deconv1)
deconv1 = TimeDistributed(Activation('relu'))(deconv1)
decoded = TimeDistributed(Conv2DTranspose(1, kernel_size=(11, 11), padding='same', strides=(4, 4), name='deconv2'))(
deconv1)
我也面临同样的问题。这里的解释是,在平坦层之前有太多的数据点。这导致RAM溢出。通过添加额外的卷积层解决了这一问题。请提供有关您的模型的信息。为什么您要为fit模型提供两次
数据
?您应该尝试自己进行train/val拆分,而不是Keras进行拆分。似乎您使用了太多的RAM,并且拆分本身需要的内存比可用内存多。因为自动编码器的输出也是图像重建的部分。它倾向于以较小的损失重建训练数据集。在我自己进行拆分后,效果很好,谢谢您的评论!如果我不为model.fit添加与验证相关的参数,它就可以工作,这意味着RAM可以处理这些数据。我的理解正确吗?