Python TensorFlow：一个网络，两个GPU？_Python_Machine Learning_Neural Network_Tensorflow

Python TensorFlow：一个网络，两个GPU？

python machine-learning neural-network tensorflow

Python TensorFlow：一个网络，两个GPU？,python,machine-learning,neural-network,tensorflow,Python,Machine Learning,Neural Network,Tensorflow,我有一个卷积神经网络，有两个不同的输出流： input | (...) <-- several convolutional layers | _________ (several layers) | | (se

我有一个卷积神经网络，有两个不同的输出流：

                         input
                           |
                         (...) <-- several convolutional layers
                           |
                       _________
    (several layers)   |       |    (several layers)
    fully-connected    |       |    fully-connected
    output stream 1 -> |       | <- output stream 2

运行非常慢（比仅在1个GPU上进行训练慢），有时会在输出中生成NaN值。我认为这可能是因为

with

语句可能没有正确同步。因此，我添加了

control\u依赖项

，并将conv层显式放置在

/gpu:0

上：

...placeholders...  # x -> input, y -> labels

with tf.device("/gpu:0"):
    with tf.control_dependencies([x, y]):
        ...conv layers...
        h_conv_flat = tf.reshape(h_conv_last, ...)

with tf.device("/gpu:0"):
    with tf.control_dependencies([h_conv_flat]):
        ...stream 1 layers...
        nn_out_1 = tf.matmul(...)

with tf.device("/gpu:1"):
    with tf.control_dependencies([h_conv_flat]):
        ...stream 2 layers...
        nn_out_2 = tf.matmul(...)

…但采用这种方法，网络甚至无法运行。无论我尝试了什么，它都抱怨输入未初始化：

tensorflow.python.framework.errors.InvalidArgumentError:
    You must feed a value for placeholder tensor 'x'
    with dtype float
    [[Node: x = Placeholder[dtype=DT_FLOAT, shape=[],
    _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

如果没有

with

语句，网络只在

/gpu:0

上训练，运行良好-训练合理的东西，没有错误

我做错了什么？TensorFlow是否无法将一个网络中的不同层流拆分为不同的GPU？我是否总是必须将整个网络拆分到不同的塔中？

有一个示例说明如何在一个网络上使用多个GPU 也许你可以复制代码。也可以得到这样的东西

# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
   a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
   b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
   c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(sum)

看看：

最好的祝愿

这取决于许多不同的因素。是相同的GPU吗？你的数据有多大？是的，两个GPU是相同的，它们在一张卡上。这是NVIDIA的双K80特斯拉卡。它有24 GB的VRAM，数据完全适合一个GPU（12 GB）的VRAM。你确定瓶颈是GPU的计算速度吗？瓶颈通常出现在进出GPU的带宽上，而不是实际计算中；如果你向另一个GPU发送一个大张量，那么在这种情况下，它只会让事情变得更糟。

# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
   a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
   b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
   c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(sum)