Python 了解tensorflow中的设备分配、并行性（tf.while_循环）和tf.function_Python_Gpgpu_Tensorflow2.0_Eager Execution

Python 了解tensorflow中的设备分配、并行性（tf.while_循环）和tf.function

python

Python 了解tensorflow中的设备分配、并行性（tf.while_循环）和tf.function,python,gpgpu,tensorflow2.0,eager-execution,Python,Gpgpu,Tensorflow2.0,Eager Execution,我试图理解tensorflow中GPU上的并行性，因为我需要将其应用于更丑陋的图形 import tensorflow as tf from datetime import datetime with tf.device('/device:GPU:0'): var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32) @tf.function def foo():

我试图理解tensorflow中GPU上的并行性，因为我需要将其应用于更丑陋的图形

import tensorflow as tf
from datetime import datetime

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations=1000)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(i, [-1,1]), tf.constant([0], dtype=tf.dtypes.float32)))
    return tf.add(i,1)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,100000)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

在上面的代码中，var是长度为100000的张量，其元素更新如上所示。当我将并行_迭代次数的值从10、100、1000、10000更改时。即使明确提到parallel_iterations变量，也几乎没有任何时差（都是9.8秒）

我希望这些在GPU上并行发生。如何实现它？

一种技术是使用分发策略和范围：

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))

另一个选项是在每个设备上复制操作：

# Replicate your computation on multiple GPUs
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)

请参阅以了解更多详细信息。首先，请注意，您的张量散射和更新只是增加一个索引，因此您只能测量循环本身的开销

我修改了您的代码，使其具有更大的批量。在GPU下的Colab中运行时，我需要batch=10000来隐藏循环延迟。低于该值的任何内容都会测量（或支付）延迟开销

另外，问题是，

var.assign（tensor\u scatter\u nd\u update（…）

是否确实阻止了

tensor\u scatter\u nd\u update

生成的额外副本？使用批量大小表明我们确实没有为额外的拷贝付费，因此额外的拷贝似乎被很好地阻止了

然而，事实证明，在这种情况下，显然，tensorflow只是认为迭代是相互依赖的，因此，如果增加循环迭代，它不会产生任何差异（至少在我的测试中是这样）。有关TF功能的进一步讨论，请参见以下内容：

只有当它们是独立的（操作）时，它才能并行工作

顺便说一句，在GPU上，一个任意的散点运算不是很有效，但是如果TF认为它们是独立的，那么您仍然可以（应该）并行执行多个

import tensorflow as tf
from datetime import datetime

size = 1000000
index_count = size
batch = 10000
iterations = 10

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([size], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
    indexes = tf.Variable(tf.range(index_count, dtype=tf.dtypes.int32), dtype=tf.dtypes.int32)
    var2 = tf.Variable(tf.range([index_count], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations = iterations)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(indexes, [-1,1])[i:i+batch], var2[i:i+batch]))
    return tf.add(i, batch)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,index_count)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

我猜它认为这些操作是依赖的（因此不可并行化）。