Python 资源耗尽:仅在gpu上分配张量时OOM

Python 资源耗尽:仅在gpu上分配张量时OOM,python,tensorflow,Python,Tensorflow,我尝试运行几种不同的ML体系结构,都是普通的,没有任何修改(git clone->python train.py)。而结果总是一样的-分段错误,或者资源耗尽:分配张量时OOM。 仅在我的CPU上运行时,程序成功完成 我正在与您一起运行会话 config.gpu_options.per_process_gpu_memory_fraction=0.33 config.gpu_options.allow_growth = True config.allow_soft_place

我尝试运行几种不同的ML体系结构,
都是普通的,没有任何修改(
git clone->python train.py
)。
而结果总是一样的-
分段错误
,或者
资源耗尽:分配张量时OOM。

仅在我的CPU上运行时,程序成功完成
我正在与您一起运行会话

    config.gpu_options.per_process_gpu_memory_fraction=0.33
    config.gpu_options.allow_growth = True
    config.allow_soft_placement = True
    config.log_device_placement = True
然而,结果是

2019-03-11 20:23:26.845851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************x**********____**********____**_____*
2019-03-11 20:23:26.845885: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):

2019-03-11 20:23:16.841149: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:16.841191: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:26.841486: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 640.00MiB.  Current allocation summary follows.
2019-03-11 20:23:26.841566: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 195, Chunks in use: 195. 48.8KiB allocated for chunks. 48.8KiB in use in bin. 23.3KiB client-requested in use in bin.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node transform_net1/tconv2/bn/moments/SquaredDifference (defined at /home/dvir/CLionProjects/gml/Dvir/FlexKernels/utils/tf_util.py:504)  = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transform_net1/tconv2/BiasAdd, transform_net1/tconv2/bn/moments/mean)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node div/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1730_div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
我和你一起跑步

tensorflow-gpu 1.12
tensorflow 1.13
GPU是

GeForce RTX 2080TI

该模型是,并已在另一台1080 ti的机器上成功测试。

如行
config.gpu\u选项所述。每个进程\u gpu\u内存\u分数=0.33
确定应从可视gpu分配的总内存量的分数(对于您的情况为33%)。增加此值或删除此行(100%)将提供更多所需的内存

对于TensorFlow 2.2.0,此脚本有效-

if tf.config.list_physical_devices('GPU'):
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
    tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4000)])

看起来您的GPU内存不足。您的模型是什么样子的?可能是您的GPU资源已被另一个进程保留(例如,尚未关闭的TensorFlow会话)。在shell中键入
nvidia smi
会得到什么?@DsCpp我也经历了类似的情况。我目前正在使用Cuda 10.0和CuDnn 7.5。您提到升级驱动程序解决了这个问题。您使用的是什么Cuda/Cudnn版本?@Jed我目前使用的是10和Cudnn 7.6,但我最终了解到我的模型对于我的硬件来说太大了,在减少批处理大小并将其实现为多gpu模型后,OOM停止了。@DsCpp感谢您的跟进。我目前使用的是10和7.5,最近从7.2升级而来,使用7.5会更好,但偶尔还是会得到OOM,不管批量大小和型号大小。不过我正在运行TF2.0 beta1,所以可能有一个内存问题尚未解决。不幸的是,问题可能出在我的cuda驱动程序中。再次安装ubuntu和驱动程序后,一切都正常。(将_分数设置为1没有帮助,因为它分配了100%的可用空间)很高兴它能工作,但我认为设置_分数应该会有效果。也许你现在可以尝试将其设置为不同的值,因为问题已经解决并看到了效果。由于该模型相对较小,内存为12GB(2080TI),即使每个进程的gpu内存分数=0.1也足够了。我认为cuda驱动程序中的一个讨厌的bug以及坏的nvidia驱动程序导致它立即分配所有内存,不管它试图运行什么作业