Python 2.7 tensorflow的缓冲区不足和资源耗尽错误

Python 2.7 tensorflow的缓冲区不足和资源耗尽错误,python-2.7,tensorflow,out-of-memory,deep-learning,underflow,Python 2.7,Tensorflow,Out Of Memory,Deep Learning,Underflow,我在上高中,我正试图做一个涉及神经网络的项目。我正在使用Ubuntu,并试图用tensorflow进行强化学习,但当我训练神经网络时,我总是收到很多运行不足的警告。它们采用ALSA lib pcm的形式。c:7963:(snd_pcm_recover)出现欠运行。随着培训的进行,此消息越来越频繁地打印到屏幕上。最终,我得到一个ResourceExhausterRor,程序终止。以下是完整的错误消息: W tensorflow/core/framework/op_kernel.cc:975] Re

我在上高中,我正试图做一个涉及神经网络的项目。我正在使用Ubuntu,并试图用tensorflow进行强化学习,但当我训练神经网络时,我总是收到很多运行不足的警告。它们采用ALSA lib pcm的形式。c:7963:(snd_pcm_recover)出现欠运行。随着培训的进行,此消息越来越频繁地打印到屏幕上。最终,我得到一个ResourceExhausterRor,程序终止。以下是完整的错误消息:

W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[320000,512]
Traceback (most recent call last):
  File "./train.py", line 121, in <module>
    loss, _ = model.train(minibatch, gamma, sess) # Train the model based on the batch, the discount factor, and the tensorflow session.
  File "/home/perrin/neural/dqn.py", line 174, in train
    return sess.run([self.loss, self.optimize], feed_dict=self.feed_dict) # Runs the training.  This is where the underrun errors happen
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[320000,512]
     [[Node: gradients/fully_connected/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](dropout/mul, gradients/fully_connected/BiasAdd_grad/tuple/control_dependency)]]

Caused by op u'gradients/fully_connected/MatMul_grad/MatMul_1', defined at:
  File "./train.py", line 72, in <module>
    model = AC_Net([None, 201, 201, 3], 5, trainer) # This creates the neural network using the imported AC_Net class.
  File "/home/perrin/neural/dqn.py", line 128, in __init__
    self.optimize = trainer.minimize(self.loss) # This tells the trainer to adjust the weights in such a way as to minimize the loss.  This is what actually
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 269, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 335, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
    in_grads = grad_fn(op, *out_grads)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 731, in _MatMulGrad
    math_ops.matmul(op.inputs[0], grad, transpose_a=True))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1729, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1442, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'fully_connected/MatMul', defined at:
  File "./train.py", line 72, in <module>
    model = AC_Net([None, 201, 201, 3], 5, trainer) # This creates the neural network using the imported AC_Net class.
  File "/home/perrin/neural/dqn.py", line 63, in __init__
    net = slim.fully_connected(net, 512, activation_fn=tf.nn.elu, scope='fully_connected') # Feeds the input through a fully connected layer
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1350, in fully_connected
    outputs = standard_ops.matmul(inputs, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1729, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1442, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[320000,512]
     [[Node: gradients/fully_connected/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](dropout/mul, gradients/fully_connected/BiasAdd_grad/tuple/control_dependency)]]
以下是我训练神经网络的代码:

#/usr/bin/python
导入游戏环境,向右移动,使用obs向右移动,随机,检查,操作系统
导入tensorflow作为tf
将numpy作为np导入
从dqn进口空调网络
def过程_输出(x):
a=[int(x>2),int(x%2==0和x>0)*2-int(x>0)]
归还
环境=游戏环境#要使用的环境
env_name=str(inspect.getmodule(environment)。uu name_u35;环境的名称
ep_长度=2000
次数=20次
total_steps=ep_length*num_Spices#总步数
model_path='/home/perrin/neural/nn/'+env_name
学习率=1e-4#学习率
trainer=tf.train.AdamOptimizer(learning_rate=learning_rate)#使用的梯度下降优化器
第一ε=0.6#随机行为的初始概率
最终ε=0.01#随机行动的最终机会
伽马=0.9
退火步数=35000#从初始到随机所需的步数
count=0#记录我们运行的步骤数
experience_buffer=[]将代理的体验存储在列表中
缓冲区大小=10000#体验缓冲区可以有多大
训练步长=256#多久训练一次模型
每列批次=10
save_step=500#保存已训练模型的频率
批量大小=256#一次培训多少次
环境大小=500#环境的高度和宽度应为多少像素。
load_model=True#是否加载预训练模型
train=True#是否训练模型
test=False#是否测试模型
tf.reset_default_graph()
sess=tf.InteractiveSession()
模型=AC_网络([None,201,201,3],5,trainer)
env=environment.env(环境大小)
动作=[0,0]
状态=环境步骤(真,操作)
saver=tf.train.saver()#这将保存模型
ε=第一ε
tf.global_variables_initializer().run()
如果加载模式为:
ckpt=tf.train.get\u checkpoint\u状态(模型路径)
saver.restore(sess、ckpt.model\u检查点\u路径)
打印“已加载模型”
prev_out=无
计数缓冲区大小时:
经验\u缓冲区.pop(0)
如果计数%train_step==0且计数>0:
打印“培训模式”
对于范围内的i(每列批次):
#获取经验的随机样本,并在此基础上训练模型。
x=random.randint(0,len(经验缓冲)-批量大小)
minibatch=np.array(经验缓冲区[x:x+批量大小])
损耗,u=模型系列(小批量、伽马、sess)
打印“批量损失”,str(i+1)+“:”,损失
如果计数%save_step==0且计数>0:
saver.save(sess,model_path++'/model-'+str(count)+'.ckpt')
打印“已保存模型”
如果计数%ep_长度==0且计数>0:
打印“开始新一集”
env=environment.env(环境大小)
如果ε>最终ε:
ε-=(第一ε-最后ε)/退火步骤
计数+=1

计数时,内存不足。您的网络可能需要比您必须运行的内存更多的内存,因此跟踪内存过度使用的第一步是找出是什么在使用如此多的内存

以下是一种使用时间线和statssummarizer的方法:

这将打印出几个表,其中一个表是按顶部内存使用情况排序的张量。你应该检查一下里面没有异常大的东西

您还可以使用Chrome visualizer查看内存时间线,详情如下

一种更高级的技术是绘制内存分配/释放的时间线,如本文所述

从理论上讲,如果不创建新的有状态ops(变量),内存使用量不应该在步骤之间增长,但我发现,如果张量的大小在步骤之间发生变化,全局内存分配可能会增长


解决方法是定期将参数保存到检查点并重新启动脚本。

内存不足,能否尝试使用较小的批处理大小?@YaroslavBulatov感谢您的建议。我尝试了批量大小为10的方法,但仍然出现了所有错误。批量大小为1的方法如何?如果内存不足,您需要使网络更小,或者使用具有更多内存的机器memory@YaroslavBulatov批量大小为1时也会发生同样的情况。因为它不会立即耗尽内存,我认为它在训练时不知何故会填满内存。除了使用更小的网络或获得更多内存之外,还有什么方法可以处理类似的事情吗?理论上,内存不应该在运行调用之间增长。在实践中,我发现如果修改张量大小,内存会增长。也就是说,如果张量都是相同大小的,它只会重用在上一次运行调用中为这些大小预先分配的内存。此外,我还运行了批量大小为2000的A3C,并将其放入TitanX内存中。如果您提供一个可复制的示例,我可以对其进行分析,并查看RAM的去向。