Python Tensorflow在不同GPU上的执行和内存
有时,当我使用单个GPU运行TensorFlow时,但在多个GPU设置中,代码将在一个GPU上执行,但在另一个GPU上分配内存。这显然会导致经济大幅放缓 作为示例,请参见以下nvidia smi的结果。这里,我的一个同事正在使用GPU 0和1(进程32918和33112),我使用以下命令启动TensorFlow(在导入TensorFlow之前) 其中,我的三个进程的gpu_id分别为2、3和4。正如我们所看到的,内存在GPU 2、3和4上分配正确,但代码在其他地方执行!在这种情况下,在GPU 0、1和7上Python Tensorflow在不同GPU上的执行和内存,python,tensorflow,Python,Tensorflow,有时,当我使用单个GPU运行TensorFlow时,但在多个GPU设置中,代码将在一个GPU上执行,但在另一个GPU上分配内存。这显然会导致经济大幅放缓 作为示例,请参见以下nvidia smi的结果。这里,我的一个同事正在使用GPU 0和1(进程32918和33112),我使用以下命令启动TensorFlow(在导入TensorFlow之前) 其中,我的三个进程的gpu_id分别为2、3和4。正如我们所看到的,内存在GPU 2、3和4上分配正确,但代码在其他地方执行!在这种情况下,在GPU 0
Wed May 17 17:04:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 41C P0 75W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 36C P0 89W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 61C P0 58W / 149W | 6265MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 42C P0 70W / 149W | 8313MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 51C P0 55W / 149W | 8311MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 29C P0 68W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 31C P0 54W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 27C P0 68W / 149W | 0MiB / 11439MiB | 33% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32918 C python 274MiB |
| 1 33112 C python 274MiB |
| 2 34891 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 3 34989 C ...sadl/anaconda3/envs/tensorflow/bin/python 8309MiB |
| 4 35075 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
+-----------------------------------------------------------------------------+
由于某种原因,tensorflow似乎部分忽略了“CUDA\u可视设备”选项
我没有在代码中使用任何设备放置命令
这是运行在Ubuntu16.04上的TensorFlow 1.1的体验,我经历了一系列不同的场景
是否存在可能发生这种情况的已知场景?如果是这样,我能做些什么吗?可能的原因之一是“nvidia smi” nvidia smi顺序与GPU ID不同 “建议希望一致性的用户使用UUDI或PCI总线ID,因为设备枚举顺序不能保证一致性” “faster_FIRST使CUDA使用一个简单的启发式方法猜测哪个设备最快,并将该设备设为0,其余设备的顺序未指定。PCI_总线ID按PCI总线ID升序排列设备。” 请看这里: 这里也讨论了:我解决了这个问题 问题似乎与nvidia smi有关,而不是与tensorflow有关。如果您通过
sudo nvidia smi-pm 1
在GPU上启用持久化模式,则会显示正确的状态,例如:
Fri May 19 15:28:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 60C P0 143W / 149W | 6263MiB / 11439MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 46C P0 136W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:08:00.0 Off | 0 |
| N/A 64C P0 110W / 149W | 8311MiB / 11439MiB | 67% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:09:00.0 Off | 0 |
| N/A 48C P0 142W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 26C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:88:00.0 Off | 0 |
| N/A 28C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:89:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 42840 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 1 42878 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 2 43264 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 3 4721 C python 8307MiB |
+-----------------------------------------------------------------------------+
感谢您在解决此问题时提供的帮助。感谢您提供的信息,但这如何解释分配似乎正确,但执行却不正确的事实呢?是的,我对此也有点困惑。显示屏告诉您,GPU 7上没有进程分配内存,但GPU 7的利用率却高达33%。它在做谁的工作?这是一个从命令行运行的服务器,所以甚至没有显示器连接到机器上。这可能是一些驱动程序错误,但我不太确定它将如何发生。
Fri May 19 15:28:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 60C P0 143W / 149W | 6263MiB / 11439MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 46C P0 136W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:08:00.0 Off | 0 |
| N/A 64C P0 110W / 149W | 8311MiB / 11439MiB | 67% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:09:00.0 Off | 0 |
| N/A 48C P0 142W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 26C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:88:00.0 Off | 0 |
| N/A 28C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:89:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 42840 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 1 42878 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 2 43264 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 3 4721 C python 8307MiB |
+-----------------------------------------------------------------------------+