Python 资源耗尽：仅在gpu上分配张量时OOM_Python_Tensorflow

Python 资源耗尽：仅在gpu上分配张量时OOM

python tensorflow

Python 资源耗尽：仅在gpu上分配张量时OOM,python,tensorflow,Python,Tensorflow,我尝试运行几种不同的ML体系结构，都是普通的，没有任何修改（git clone->python train.py）。而结果总是一样的-分段错误，或者资源耗尽：分配张量时OOM。仅在我的CPU上运行时，程序成功完成我正在与您一起运行会话 config.gpu_options.per_process_gpu_memory_fraction=0.33 config.gpu_options.allow_growth = True config.allow_soft_place

我尝试运行几种不同的ML体系结构，
都是普通的，没有任何修改（

git clone->python train.py

）。
而结果总是一样的-

分段错误

，或者

资源耗尽：分配张量时OOM。

仅在我的CPU上运行时，程序成功完成
我正在与您一起运行会话

    config.gpu_options.per_process_gpu_memory_fraction=0.33
    config.gpu_options.allow_growth = True
    config.allow_soft_placement = True
    config.log_device_placement = True

然而，结果是

2019-03-11 20:23:26.845851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************x**********____**********____**_____*
2019-03-11 20:23:26.845885: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):

2019-03-11 20:23:16.841149: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:16.841191: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:26.841486: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 640.00MiB.  Current allocation summary follows.
2019-03-11 20:23:26.841566: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 195, Chunks in use: 195. 48.8KiB allocated for chunks. 48.8KiB in use in bin. 23.3KiB client-requested in use in bin.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node transform_net1/tconv2/bn/moments/SquaredDifference (defined at /home/dvir/CLionProjects/gml/Dvir/FlexKernels/utils/tf_util.py:504)  = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transform_net1/tconv2/BiasAdd, transform_net1/tconv2/bn/moments/mean)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node div/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1730_div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

我和你一起跑步

tensorflow-gpu 1.12
tensorflow 1.13

GPU是

GeForce RTX 2080TI

该模型是，并已在另一台1080 ti的机器上成功测试。

如行

config.gpu\u选项所述。每个进程\u gpu\u内存\u分数=0.33

确定应从可视gpu分配的总内存量的分数（对于您的情况为33%）。增加此值或删除此行（100%）将提供更多所需的内存

对于TensorFlow 2.2.0，此脚本有效-

if tf.config.list_physical_devices('GPU'):
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
    tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4000)])

看起来您的GPU内存不足。您的模型是什么样子的？可能是您的GPU资源已被另一个进程保留（例如，尚未关闭的TensorFlow会话）。在shell中键入

nvidia smi

会得到什么？@DsCpp我也经历了类似的情况。我目前正在使用Cuda 10.0和CuDnn 7.5。您提到升级驱动程序解决了这个问题。您使用的是什么Cuda/Cudnn版本？@Jed我目前使用的是10和Cudnn 7.6，但我最终了解到我的模型对于我的硬件来说太大了，在减少批处理大小并将其实现为多gpu模型后，OOM停止了。@DsCpp感谢您的跟进。我目前使用的是10和7.5，最近从7.2升级而来，使用7.5会更好，但偶尔还是会得到OOM，不管批量大小和型号大小。不过我正在运行TF2.0 beta1，所以可能有一个内存问题尚未解决。不幸的是，问题可能出在我的cuda驱动程序中。再次安装ubuntu和驱动程序后，一切都正常。（将_分数设置为1没有帮助，因为它分配了100%的可用空间）很高兴它能工作，但我认为设置_分数应该会有效果。也许你现在可以尝试将其设置为不同的值，因为问题已经解决并看到了效果。由于该模型相对较小，内存为12GB（2080TI），即使每个进程的gpu内存分数=0.1也足够了。我认为cuda驱动程序中的一个讨厌的bug以及坏的nvidia驱动程序导致它立即分配所有内存，不管它试图运行什么作业