Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 为什么TensorFlow总是使用GPU 0?_Python_Tensorflow_Machine Learning - Fatal编程技术网

Python 为什么TensorFlow总是使用GPU 0?

Python 为什么TensorFlow总是使用GPU 0?,python,tensorflow,machine-learning,Python,Tensorflow,Machine Learning,在多个GPU设置上运行TensorFlow推断时,我遇到了一个问题 环境:Python 3.6.4;TensorFlow 1.8.0;Centos 7.3; 2英伟达特斯拉P4 以下是系统空闲时的nvidia smi输出: Tue Aug 28 10:47:42 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.81

在多个GPU设置上运行TensorFlow推断时,我遇到了一个问题

环境:Python 3.6.4;TensorFlow 1.8.0;Centos 7.3; 2英伟达特斯拉P4

以下是系统空闲时的nvidia smi输出:

Tue Aug 28 10:47:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   38C    P0    22W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   39C    P0    23W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
与我的问题相关的关键声明:

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

def get_sess_and_tensor(ckpt_path):
    assert os.path.exists(ckpt_path), "file: {} not exist.".format(ckpt_path)
    graph = tf.Graph()
    with graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(ckpt_path, "rb") as fid1:
            od_graph_def.ParseFromString(fid1.read())
            tf.import_graph_def(od_graph_def, name="")
        sess = tf.Session(graph=graph)
    with tf.device('/gpu:1'):
        tensor = graph.get_tensor_by_name("image_tensor:0")
        boxes = graph.get_tensor_by_name("detection_boxes:0")
        scores = graph.get_tensor_by_name("detection_scores:0")
        classes = graph.get_tensor_by_name('detection_classes:0')

    return sess, tensor, boxes, scores, classes
所以,问题是,当我将I visible devices设置为“0,1”时,即使我将tf.device设置为GPU 1,在运行推断时,我从nvidia smi中看到仅使用GPU 0(GPU 0的GPU Util很高–几乎100%–而GPU 1为0)。为什么它不使用GPU1

我想并行使用两个GPU,但即使使用以下代码,它仍然只使用GPU 0:

with tf.device('/gpu:0'):
    tensor = graph.get_tensor_by_name("image_tensor:0")
    boxes = graph.get_tensor_by_name("detection_boxes:0")
with tf.device('/gpu:1'):
    scores = graph.get_tensor_by_name("detection_scores:0")
    classes = graph.get_tensor_by_name('detection_classes:0')
如有任何建议,我们将不胜感激

谢谢


Wesley

根据您的设置,设备名称可能会有所不同

执行:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
并尝试将设备名用于您的第二个GPU,与上面列出的完全相同。

您可以使用GPUtil包选择未使用的GPU并过滤CUDA\u VISIBLE\u DEVICES环境变量

这将允许您在所有GPU上运行并行实验

# Import os to set the environment variable CUDA_VISIBLE_DEVICES
import os
import tensorflow as tf
import GPUtil

# Set CUDA_DEVICE_ORDER so the IDs assigned by CUDA match those from nvidia-smi
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Get the first available GPU
DEVICE_ID_LIST = GPUtil.getFirstAvailable()
DEVICE_ID = DEVICE_ID_LIST[0] # grab first element from list

# Set CUDA_VISIBLE_DEVICES to mask out all other GPUs than the first available device id
os.environ["CUDA_VISIBLE_DEVICES"] = str(DEVICE_ID)

# Since all other GPUs are masked out, the first available GPU will now be identified as GPU:0
device = '/gpu:0'
print('Device ID (unmasked): ' + str(DEVICE_ID))
print('Device ID (masked): ' + str(0))

# Run a minimum working example on the selected GPU
# Start a session
with tf.Session() as sess:
    # Select the device
    with tf.device(device):
        # Declare two numbers and add them together in TensorFlow
        a = tf.constant(12)
        b = tf.constant(30)
        result = sess.run(a+b)
        print('a+b=' + str(result))

参考资料:

至少您的设备名称不符合要求。它应该是:
/device::
。如果使用
/device:gpu:{0,1}
,会发生什么情况?另请参见。@agtoever实际上,我在很多文章中都看到了/gpu:0格式,但不幸的是,我也尝试了你的建议,但同样的问题。我认为你使用
tf.device
太晚了。您需要包装定义ops的代码。我不知道导入图形时会发生什么情况,但您可能想尝试使用tf.device包装器移动
,以便它包装
GraphDef
内容。@xdurch0您的意思是在培训过程中控制设备?实际上,对于培训,我没有指定可见设备和tf.device,但我认为这与此问题无关,如果是这样,是否也意味着如果我在培训时指定GPU0,它只能使用GPU0进行推理?我的意思是:您应该尝试类似于tf.device(“/gpu:0”):tf.import_graph_def(…)
等。因此,将
设备
移动到构建图形的位置。现在你只有一个部分,你从一个已经存在的图中得到张量。这太晚了;到那时,ops已经安装在设备上(默认情况下为GPU0)。