Python 内存不足-Keras tensorflow GPU-梯度累积

Python 内存不足-Keras tensorflow GPU-梯度累积,python,tensorflow,keras,out-of-memory,Python,Tensorflow,Keras,Out Of Memory,我正在一个非常大的时间序列数据集上运行一个简单的自动编码器。 具体而言: 我的培训样本由500000时间序列组成 我的验证集由100000时间序列组成 我的测试集由100000时间序列组成 每个时间序列都有5994时间分量 我正在集群上运行自动编码器的培训,我通过ssh访问该集群。集群配有gpu,因此我安装了tensorflow gpu,以便能够使用gpu加速。集群有2个Nvidia Tesla K80卡 我面临的问题是,我的数据集似乎总是太大,集群无法处理。因此,即使在向SLURM调度程序

我正在一个非常大的时间序列数据集上运行一个简单的自动编码器。 具体而言:

  • 我的培训样本由
    500000
    时间序列组成
  • 我的验证集由
    100000
    时间序列组成
  • 我的测试集由
    100000
    时间序列组成
每个时间序列都有
5994
时间分量

我正在集群上运行自动编码器的培训,我通过
ssh
访问该集群。集群配有gpu,因此我安装了
tensorflow gpu
,以便能够使用gpu加速。集群有2个Nvidia Tesla K80卡

我面临的问题是,我的数据集似乎总是太大,集群无法处理。因此,即使在向SLURM调度程序启动作业脚本时请求大量内存,我也会在输出中不断得到“MemoryErrors”

我开始考虑使用梯度累积来训练我的人际网络,以克服这个问题。然而,我没有这方面的经验,我很难理解在我的软件中实现这一点需要做哪些修改

下面是我在
Keras
中的AE代码(使用
tensorflow gpu
后端)。实施坡度累积需要进行哪些更改

from __future__ import division
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Flatten, Lambda, Activation, Conv1D, MaxPooling1D, UpSampling1D, Reshape, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers
import keras.backend as K
from tensorflow.keras.layers import Add

import tensorflow as tf

import scipy.io
import sys
import matplotlib.pyplot as plt
import numpy as np
import copy
import h5py
import random


####################
### DATA READING ###
####################


# reading training
train = np.load('./training_0.npy')
for i in range(1, 10):
    train_2 = np.load('./training_{}.npy'.format(i))    
    train = np.vstack((train, train_2))

print('training shape', np.shape(train))   # (500000, 5994)

# reading validation
val = np.load('./training_10.npy')  
val_2 = np.load('./training_11.npy')
val = np.vstack((val, val_2))

print('validation shape', np.shape(val))  # (100000, 5994)

#reading testing
test = np.load('./training_12.npy') 
test_2 = np.load('./training_13.npy')
test = np.vstack((test, test_2))

print('testing shape', np.shape(test))   # (100000, 5994)


# n. time components
number_of_time_components = np.shape(train)[1]   # 5994


###################
### AUTOENCODER ###
###################

# network parameters
Dense_1 = 2048
Dense_2 = 1024
Dense_3 = 512
Dense_4 = 256

# training parameters
epochs = 20


def Encoder():
    encoder_input = Input(batch_shape=(None, number_of_time_components))
    encoded = Dense(Dense_1,activation = 'tanh')(encoder_input)
    encoded = Dense(Dense_2,activation = 'tanh')(encoded)
    encoded = Dense(Dense_3,activation = 'tanh')(encoded)
    encoded = Dense(Dense_4,activation = 'tanh')(encoded)
    return Model(encoder_input, encoded)


 def DecoderAE(encoder_input, encoded_input):
     decoded_3 = Dense(Dense_3,activation = 'tanh', name='dec_3')(encoded_input)
     decoded_2 = Dense(Dense_2,activation = 'tanh', name='dec_2')(decoded_3)
     decoded_1 = Dense(Dense_1,activation = 'tanh', name='dec_1')(decoded_2)
     decoded = Dense(number_of_time_components,activation = 'tanh', name='dec_out')(decoded_1)
     return Model(encoder_input, decoded)


encoder = Encoder()
AE = DecoderAE(encoder.input, encoder.output)
AE.compile(optimizer='adam', loss='mse')


# train the model
history = AE.fit(x = train, y = train,
                    epochs=epochs,
                    batch_size=50000,
                    validation_data = (val, val))


# plot loss
loss = history.history['loss']
val_loss = history.history['val_loss']
range_epochs = range(epochs)
plt.figure()
plt.plot(range_epochs, loss, 'bo', label='Training loss')
plt.plot(range_epochs, val_loss, 'b', label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and validation loss')
plt.legend()
plt.savefig('IMAGES/loss_{}.pdf'.format(filedescriptor))
plt.close()


# predictions for testing
prediction = AE.predict(test)


# plot testing predictions vs original
fig, ax = plt.subplots(5, figsize=(15,30))
for i in range(5):
        ax[i].plot(test_original[10000*i], color='blue', label='Original')
        ax[i].plot(prediction[10000*i], color='red', label='AE encoded-decoded')
        ax[i].set_xlabel('Time components', fontsize='x-large')
        ax[i].set_ylabel('Amplitude', fontsize='x-large')
        ax[i].set_title('Time series n. {:}'.format(500000+10000*i+1), fontsize='x-large')
        ax[i].legend(fontsize='large')
plt.subplots_adjust(hspace=1)
fig.savefig('IMAGES/COMPRESSION_test_{}.pdf'.format(filedescriptor))
plt.close()
编辑

为了回答commnets中的一个问题,这里是我最近遇到的一个错误,一个分段错误

Skipping registering GPU devices...
2020-03-17 11:59:38.147624: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-17 11:59:38.155618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300180000 Hz
2020-03-17 11:59:38.155743: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a0c00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.155761: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-17 11:59:38.203600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a34e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.203631: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-03-17 11:59:38.203714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-17 11:59:38.203724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      
2020-03-17 11:59:39.235137: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
2020-03-17 11:59:55.898563: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
containerscript.sh: line 4:  3654 Segmentation fault      python Autoencoder.py
srun: error: compute-gpu-0-0: task 0: Exited with exit code 139
下面是用于在SLURM上启动作业的作业脚本

#!/bin/bash
#SBATCH -p GPU
#SBATCH -N1
#SBATCH -n1
#SBATCH --mem=100000
#SBATCH --gres=gpu:k80:1

./containerscript.sh
其中./containerscript.sh仅包含

python Autoencoder.py

你愿意分享错误吗?@DanielMöller我分享了我的最后一个错误,这是一个分段错误。我还将上传用于启动作业的作业脚本(其中指定了所请求的内存等)。您是否会共享错误?@DanielMöller我共享了我的上一个错误,这是一个分段错误。我还将上传用于启动作业的作业脚本(指定所请求的内存等)