Python Tensorflow在GPU内存未完全使用时分配更大的池。_Python_Tensorflow_Tensorflow Gpu

Python Tensorflow在GPU内存未完全使用时分配更大的池。

python tensorflow

Python Tensorflow在GPU内存未完全使用时分配更大的池。,python,tensorflow,tensorflow-gpu,Python,Tensorflow,Tensorflow Gpu,我正在运行一个tensorflow作业，它总是被困在增加池大小上，并且永远不会从那里取得进展以下是输出： 2017-11-13 19:01:12.841317: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could

我正在运行一个tensorflow作业，它总是被困在增加池大小上，并且永远不会从那里取得进展

以下是输出：

2017-11-13 19:01:12.841317: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-13 19:01:12.841715: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-13 19:01:12.841729: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-13 19:01:17.941982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:09:00.0
Total memory: 4.63GiB
Free memory: 4.56GiB
2017-11-13 19:01:18.135538: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x6e48240 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-13 19:01:18.136394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:0a:00.0
Total memory: 4.63GiB
Free memory: 4.56GiB
2017-11-13 19:01:18.324134: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x6e4a680 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-13 19:01:18.325028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 2 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:0d:00.0
Total memory: 4.63GiB
Free memory: 4.56GiB
2017-11-13 19:01:18.519043: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x60ae510 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-13 19:01:18.519928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 3 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:0e:00.0
Total memory: 4.63GiB
Free memory: 4.56GiB
2017-11-13 19:01:18.521497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 
2017-11-13 19:01:18.521514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y Y Y 
2017-11-13 19:01:18.521523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y Y Y 
2017-11-13 19:01:18.521530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2:   Y Y Y Y 
2017-11-13 19:01:18.521538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3:   Y Y Y Y 
2017-11-13 19:01:18.521556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:09:00.0)
2017-11-13 19:01:18.521566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K20m, pci bus id: 0000:0a:00.0)
2017-11-13 19:01:18.521580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K20m, pci bus id: 0000:0d:00.0)
2017-11-13 19:01:18.521589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K20m, pci bus id: 0000:0e:00.0)
2017-11-13 19:01:24.197527: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2731 get requests, put_count=2675 evicted_count=1000 eviction_rate=0.373832 and unsatisfied allocation rate=0.423288
2017-11-13 19:01:24.197943: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110

当我获得GPU利用率时，内存几乎未使用，如下所示：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 00000000:09:00.0 Off |                    0 |
| N/A   42C    P0    93W / 225W |    646MiB /  4742MiB |     30%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P0    43W / 225W |     72MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 00000000:0D:00.0 Off |                    0 |
| N/A   35C    P0    45W / 225W |     72MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          Off  | 00000000:0E:00.0 Off |                    0 |
| N/A   33C    P0    43W / 225W |     72MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K20m          Off  | 00000000:28:00.0 Off |                    0 |
| N/A   35C    P0    45W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K20m          Off  | 00000000:2B:00.0 Off |                    0 |
| N/A   37C    P0    45W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K20m          Off  | 00000000:30:00.0 Off |                    0 |
| N/A   38C    P0    45W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K20m          Off  | 00000000:33:00.0 Off |                    0 |
| N/A   32C    P0    46W / 225W |      0MiB /  4742MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1554      C   python                                       635MiB |
|    1      1554      C   python                                        61MiB |
|    2      1554      C   python                                        61MiB |
|    3      1554      C   python                                        61MiB |
+-----------------------------------------------------------------------------+

以下是我的会话配置：

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9)
config = tf.ConfigProto(allow_soft_placement=True, gpu_options=gpu_options)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:

有什么东西我遗漏了吗？它使用的CPU也非常大，即

8个CPU，每个30克

我看不见也不明白这是为什么

你能分享一些tensorflow作业的代码吗？如果要明确说明要使用哪个GPU，请尝试使用tf.device（'/device:GPU:0'）：这听起来像是

allow\u growth=True

路径中的一个bug。如果不将

allow_growth

设置为

True

，它是否仍然会卡住？如果是这样，请用某种方法复制错误。@DiegoAgher没有尝试，他会尝试。@mrry是的，在没有将其设置为True的情况下尝试，仍然不起作用。在这种情况下，听起来像是一个更严重的错误。请用复制打开一个问题！你能分享一些tensorflow作业的代码吗？如果要明确说明要使用哪个GPU，请尝试使用tf.device（'/device:GPU:0'）：这听起来像是

allow\u growth=True

路径中的一个bug。如果不将

allow_growth

设置为

True

，它是否仍然会卡住？如果是这样，请用某种方法复制错误。@DiegoAgher没有尝试，他会尝试。@mrry是的，在没有将其设置为True的情况下尝试，仍然不起作用。在这种情况下，听起来像是一个更严重的错误。请用复制打开一个问题！