TensorFlow 2-gpu比单gpu慢_Tensorflow

TensorFlow 2-gpu比单gpu慢

tensorflow

TensorFlow 2-gpu比单gpu慢,tensorflow,Tensorflow,我有两个gpu（TitanX（Pascal）和GTX1080）。我正在努力运行单线程图形计算。该图是两个独立的矩阵乘法链（每个都分配给相应的gpu）以下是我正在使用的代码：导入tensorflow作为tf 将numpy作为np导入随机输入导入时间导入日志记录 from tensorflow.python.ops import init_ops from tensorflow.python.client import timeline def test(): n = 500

我有两个gpu（TitanX（Pascal）和GTX1080）。我正在努力运行单线程图形计算。该图是两个独立的矩阵乘法链（每个都分配给相应的gpu）

以下是我正在使用的代码：

导入tensorflow作为tf 将numpy作为np导入随机输入导入时间导入日志记录

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline


def test():
    n = 5000

    with tf.Graph().as_default():
        A1 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        A2 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        with tf.device('/gpu:0'):
            B1 = A1
            for l in xrange(10):
                B1 = tf.matmul(B1, A1)

        with tf.device('/gpu:1'):
            B2 = A2
            for l in xrange(10):
                B2 = tf.matmul(B2, A2)
            C = tf.matmul(B1, B2)

        run_metadata = tf.RunMetadata()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
            start = time.time()
            logging.info('started')
            A1_ = np.random.rand(n, n)
            A2_ = np.random.rand(n, n)
            sess.run([C],
                     feed_dict={A1: A1_, A2: A2_},
                     options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                     run_metadata=run_metadata)
            logging.info('writing trace')
            trace = timeline.Timeline(step_stats=run_metadata.step_stats)
            trace_file = open('timeline.ctf.json', 'w')
            trace_file.write(trace.generate_chrome_trace_format())
            logging.info('trace written')
            end = time.time()
            logging.info('computed')
            logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

完成需要20.4秒
如果我将所有操作设置为gpu0（TitanX），则需要14秒才能完成
如果我将所有操作设置为gpu1（GTX 1080），则需要19.8秒才能完成

我可以看到tensorflow找到了两个GPU，并正确设置了所有设备。为什么使用两个gpu而不是一个gpu时没有加速？GPU是不同型号（AFAIK cuda允许）可能会有问题吗

谢谢

编辑我更新了代码，为两条链使用不同的初始矩阵，因为否则tensorflow似乎会进行一些优化

以下是时间轴配置文件json文件链接：

这一时间线提出的问题多于答案：

为什么pid 7（gpu0）有两行执行

pid 3和5中的长Matmul是什么？（输入0“_recv_A_0/_3”，输入1“_recv_A_0/_3”，名称“MatMul”，操作“MatMul”）

似乎每个操作都是在gpu0 execept pid 5上执行的

在pid 3和pid 5的长MatMul操作之后，有很多小MatMul操作（从屏幕截图上看不到）。这是什么

这不是因为在计算

时需要在gpu之间传输数据吗？你能试着把

放在cpu上吗

with tf.device('/cpu:0'):
  C = tf.matmul(B1, B2)

第一次在GPU上启动内核时会有很大的延迟，这可能是由PTXAS编译引起的。这种延迟可以是秒级的，并且在使用超过1个GPU时会累积，因此在您的情况下，运行会较慢，因为时间主要由额外的“初始内核启动”控制。测试纯计算时间的一种方法是通过在每个GPU上至少执行一次cuda操作来“预热”。通过在2张TitanX卡上运行基准测试，我观察到了相同的速度，但当我“预热”内核时，这种延迟消失了

以下是预热前的步骤：

以下是预热后的内容：下面是经过修改的代码，可以进行预暖，也可以删除任何TensorFlowPython传输

import tensorflow as tf

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline
import logging, time
import numpy as np

def test():
    n = 5000

    with tf.device('/gpu:0'):
        A1 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A1')
        B1 = A1
        for l in xrange(10):
            B1 = tf.matmul(A1, B1, name="chain1")

    with tf.device('/gpu:1'):
        A2 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A2')
        B2 = A2
        for l in xrange(10):
            B2 = tf.matmul(A2, B2, name="chain2")
        C = tf.matmul(B1, B2)

    run_metadata = tf.RunMetadata()
    start = time.time()
    logging.info('started')
    sess = tf.InteractiveSession(config=tf.ConfigProto(allow_soft_placement=False, log_device_placement=True))
    sess.run(tf.initialize_all_variables())
    # do warm-run
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    run_metadata = tf.RunMetadata()
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    logging.info('writing trace')
    trace = timeline.Timeline(step_stats=run_metadata.step_stats)
    trace_file = open('timeline.ctf.json', 'w')
    trace_file.write(trace.generate_chrome_trace_format(show_memory=True))
    logging.info('trace written')
    end = time.time()
    logging.info('computed')
    logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

您可以查看时间线以了解瓶颈是什么，您可以执行

sess.run（C.op）

而不是

sess.run（C）

，TensorFlow->Python从计时中传输我得到了一个错误“TypeError:\uuu init\uuuuuuuuu（）会话构造函数中出现意外的关键字参数“run\u metadata”。我在2016年9月从源代码安装了tensorflow，并尝试从pip重新安装它（仍然收到相同的错误）@YaroslavBulatov感谢timeline profiler的建议。我更新了帖子，提出了更多问题。GPU有多个流，因此一些内容在显示器中被复制——它可以在GPU的“计算”通道以及专用的“流”中显示相同的计算“频道。长时间的运行可能是由于初始内核启动开销造成的，发布的基准测试和预热没有帮助。另外，我认为这不重要，因为在最后一个gpu之前，每个gpu都有500个矩阵MLT。顺便说一句，时间线中的标签是一个误导性的“gpu:0/stream:22”实际上在gpu:1上，从log_device_Placement中可以看出，谢谢澄清。但这些时间线的痕迹对我来说仍然很奇怪。为什么如果我在gpu0上分配所有操作，那么它们总是被计算出来（我尝试了从5000到20000的不同矩阵大小，以及从100到100的不同链长度）？这两条链似乎可以在两个并行流中计算，即使在单个gpu上也是如此。这是正确的，tensorflow不会在并行流上调度操作，而是每个操作都可以使用gpu的所有流