Memory leaks Tensorflow cifar 10示例内存泄漏

Memory leaks Tensorflow cifar 10示例内存泄漏,memory-leaks,tensorflow,Memory Leaks,Tensorflow,我是Tensorflow的新手。我尝试从这里运行cifar10示例: 我没有对代码做任何更改,只是尝试在多个GPU上运行它。我正在尝试使用6个GPU,我正在为我的工作分配10 GB的RAM,但几分钟后,由于内存限制,我的工作失败了。分配更多内存没有帮助,它只是延迟了错误。我尝试了高达40GB的内存 以下是有关我的系统的更多信息: =======================================================Linux mmmdgx01 4.4.0-45-gene

我是Tensorflow的新手。我尝试从这里运行cifar10示例:

我没有对代码做任何更改,只是尝试在多个GPU上运行它。我正在尝试使用6个GPU,我正在为我的工作分配10 GB的RAM,但几分钟后,由于内存限制,我的工作失败了。分配更多内存没有帮助,它只是延迟了错误。我尝试了高达40GB的内存

以下是有关我的系统的更多信息:

=======================================================Linux mmmdgx01 4.4.0-45-generic 66~14.04.1-Ubuntu SMP周三10月19日 15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux DGX_OTA_版本=2.0.5 VERSION=“14.04.5 LTS,Trusty Tahr”VERSION\u ID=“14.04”

==我们在码头吗=============================================================================================================不

< P>=编译器编译器=C++(Ubuntu 4.84-2Ubuntu1~1404.3)4.84版权(C)2013免费软件 基金会,这是免费软件;请参阅复制的源代码 条件没有担保;甚至不考虑适销性或 适合某一特定目的

=============================================================Linux mmmdgx01 4.4.0-45-generic 66~14.04.1-Ubuntu SMP周三10月19日 15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

==检查PIP===================================================================numpy(1.11.1)protobuf(3.2.0)tensorflow(1.1.0rc1)

==检查virtualenv====================================================================================================错误

==tensorflow导入=========================================tf.VERSION=1.1.0-rc1 tf.GIT\u VERSION=v1.1.0-rc1-272-gf77f19b tf.COMPILER_VERSION=v1.1.0-rc1-272-gf77f19b健全性检查:数组(, dtype=int32)

=====================================================================================================================================LD\U库路径 /opt/sw/cuda/8.0/lib64/:/project/DGX/cuda/lib64/:/opt/sw/cuda/8.0/extras/CUPTI/lib64/:/project/DGX/lib 动态库路径 /project/DGX/torch/install/lib:/project/torch7new/install/lib:

==英伟达smi===================================================================2017年5月12日星期五15:46:50 +-----------------------------------------------------------------------------+| NVIDIA-SMI 375.20驱动程序版本:375.20
| |-------------------------------+----------------------+----------------------+| GPU名称持久化-M |总线Id显示A |易失性 解开。ECC | |风扇温度性能压水堆:使用率/上限|内存使用率| GPU Util Compute M| |=======================================================================================================================================================================================0特斯拉P100-SXM2。。。开| 0000:06:00.0关|
0 | | N/A 34C P042W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 1特斯拉P100-SXM2。。。开| 0000:07:00.0关|
0 | | N/A 32C P032W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 2特斯拉P100-SXM2。。。开| 0000:0A:00.0关|
0 | | N/A 34C P0 33W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 3特斯拉P100-SXM2。。。开| 0000:0B:00.0关|
0 | | N/A 33C P032W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 4特斯拉P100-SXM2。。。开| 0000:85:00.0关|
0 | | N/A 33C P0 30W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 5特斯拉P100-SXM2。。。开| 0000:86:00.0关|
0 | | N/A 33C P0 33W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 6特斯拉P100-SXM2。。。开| 0000:89:00.0关|
0 | | N/A 31C P032W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+| 7特斯拉P100-SXM2。。。开| 0000:8A:00.0关|
0 | | N/A 35C P0 32W/300W | 0MiB/16308MiB | 0%
违约| +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+|进程:GPU 内存| | GPU PID类型进程名称
用法| |===================================================================================================================================================未找到正在运行的进程
| +-----------------------------------------------------------------------------+

==cuda libs===================================================

以下是我的作业提交脚本:

#! /bin/bash
#SBATCH --account=AI
#SBATCH --time=167:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH -J TFImgNet
#SBATCH -e tf.err
#SBATCH -o tf.log
#SBATCH --mem=10960
#SBATCH --gres=gpu:6
cpath=$(pwd)
cd ~
source .bashrc
cd $cpath
which python
python cifar10_multi_gpu_train.py --num_gpus 6
以下是错误:

2017-05-12 15:14:07.162709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 4 5
2017-05-12 15:14:07.162718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y Y Y Y N
2017-05-12 15:14:07.162721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y Y Y N Y
2017-05-12 15:14:07.162724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   Y Y Y Y N N
2017-05-12 15:14:07.162727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   Y Y Y Y N N
2017-05-12 15:14:07.162729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 4:   Y N N N Y Y
2017-05-12 15:14:07.162732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 5:   N Y N N Y Y
2017-05-12 15:14:07.162743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
2017-05-12 15:14:07.162747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
2017-05-12 15:14:07.162751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0a:00.0)
2017-05-12 15:14:07.162754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0b:00.0)
2017-05-12 15:14:07.162756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
2017-05-12 15:14:07.162759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla P100-SXM2-16GB, pci bus id: 0000:86:00.0)
slurmstepd: error: Job 1313520 exceeded memory limit (11240536 > 11223040), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 1313520 ON mmmdgx01 CANCELLED AT 2017-05-12T15:28:58 ***