Tensorflow 将tf.map_fn与多个GPU一起使用

Tensorflow 将tf.map_fn与多个GPU一起使用,tensorflow,gpgpu,multi-gpu,Tensorflow,Gpgpu,Multi Gpu,我正在尝试将我的单GPU TensorFlow代码扩展到多GPU。我必须在3个自由度上工作,不幸的是,我需要使用tf.map_fn在第3个自由度上进行并行化。我尝试使用官方文档中所示的设备放置,但使用tf.map\fn似乎是不可能的。有没有办法在多个GPU上运行tf.map\fn 此处显示错误输出: InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'map_1/TensorA

我正在尝试将我的单GPU TensorFlow代码扩展到多GPU。我必须在3个自由度上工作,不幸的是,我需要使用tf.map_fn在第3个自由度上进行并行化。我尝试使用官方文档中所示的设备放置,但使用
tf.map\fn
似乎是不可能的。有没有办法在多个GPU上运行
tf.map\fn

此处显示错误输出:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'map_1/TensorArray_1': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/device:GPU:1'
Colocation Debug Info:
Colocation group had the following types and devices: 
TensorArrayGatherV3: GPU CPU 
Range: GPU CPU 
TensorArrayWriteV3: GPU CPU 
TensorArraySizeV3: GPU CPU 
MatMul: GPU CPU 
Enter: GPU CPU 
TensorArrayV3: GPU CPU 
Const: GPU CPU 

Colocation members and user-requested devices:
  map_1/TensorArrayStack/range/delta (Const) 
  map_1/TensorArrayStack/range/start (Const) 
  map_1/TensorArray_1 (TensorArrayV3) 
  map_1/while/TensorArrayWrite/TensorArrayWriteV3/Enter (Enter) /device:GPU:1
  map_1/TensorArrayStack/TensorArraySizeV3 (TensorArraySizeV3) 
  map_1/TensorArrayStack/range (Range) 
  map_1/TensorArrayStack/TensorArrayGatherV3 (TensorArrayGatherV3) 
  map_1/while/MatMul (MatMul) /device:GPU:1
  map_1/while/TensorArrayWrite/TensorArrayWriteV3 (TensorArrayWriteV3) /device:GPU:1

         [[Node: map_1/TensorArray_1 = TensorArrayV3[clear_after_read=true, dtype=DT_FLOAT, dynamic_size=false, element_shape=<unknown>, identical_element_shapes=true, tensor_array_name=""](map_1/TensorArray_1/size)]]

您试图做的事情可以通过批处理matmul来完成。考虑以下变化:

import tensorflow as tf
import numpy
import time
import numpy as np

rc = 1000

sess = tf.Session()

#compute on cpu for comparison later
vals = np.random.uniform(size=[rc,rc,4]).astype(np.float32)
mat1 = tf.identity(vals)
mat2 = tf.transpose(vals, [2, 0, 1])

#store mul in array so all are fetched in run call
muls = []
#I only have one GPU.
for deviceName in ['/cpu:0', '/device:GPU:0']:
    with tf.device(deviceName):

        def mult(i):
                product = tf.matmul(mat1[:,:,i],mat1[:,:,i+1])
                return product

        mul = tf.zeros([rc,rc,3], dtype = tf.float32)
    mul = tf.map_fn(mult, numpy.array([0,1,2]), dtype = tf.float32, parallel_iterations = 10)
    muls.append(mul)

#use transposed mat with a shift to matmul in one go
mul = tf.matmul(mat2[:-1], mat2[1:])

print(muls)
print(mul)

start = time.time()
m1 = sess.run(muls)
end = time.time()

print("muls:", end - start)

start = time.time()
m2 = sess.run(mul)
end = time.time()

print("mul:", end - start)

print(np.allclose(m1[0],m1[1]))
print(np.allclose(m1[0],m2))
print(np.allclose(m1[1],m2))
我的电脑上的结果如下:

[<tf.Tensor 'map/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>, <tf.Tensor 'map_1/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>]
Tensor("MatMul:0", shape=(3, 1000, 1000), dtype=float32)
muls: 0.4262731075286865
mul: 0.3794088363647461
True
True
True
[,]
张量(“MatMul:0”,shape=(3,1000,1000),dtype=float32)
muls:0.4262731075286865
mul:0.3794088363647461
真的
真的
真的

您很少希望与GPU同步使用CPU,因为它将成为瓶颈。GPU将等待CPU完成。如果您使用CPU执行任何操作,它应该与GPU异步,以便GPU可以全速运行。

您是否尝试过不在设备上分配映射?@McAngus感谢您的评论。刚刚尝试过,在某种程度上它是可行的,但是相对于单个gpu代码来说,它的运行速度要慢三倍多。看nvidia smi和htop,它看起来几乎只使用cpu。谢谢,我现在没有时间测试它,但我肯定会利用你的提示尽快解决我的问题。
[<tf.Tensor 'map/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>, <tf.Tensor 'map_1/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>]
Tensor("MatMul:0", shape=(3, 1000, 1000), dtype=float32)
muls: 0.4262731075286865
mul: 0.3794088363647461
True
True
True