Python Tensorflow时间线显示,梯度平均值是使用多个GPU时的性能瓶颈

Python Tensorflow时间线显示,梯度平均值是使用多个GPU时的性能瓶颈,python,tensorflow,Python,Tensorflow,我使用多个(实际上是2个)GPU来训练网络。网络运行良好,但我发现训练速度波动 这是我用于分析的snipet: for i in range(resume_epoch, c.num_epochs): print("Epoch %d" % i) sess.run(train_itr.initializer) num_batches = num_egs // c.batch_size for batch in range(num_batches): s

我使用多个(实际上是2个)GPU来训练网络。网络运行良好,但我发现训练速度波动

这是我用于分析的snipet:

for i in range(resume_epoch, c.num_epochs):
    print("Epoch %d" % i)
    sess.run(train_itr.initializer)
    num_batches = num_egs // c.batch_size
    for batch in range(num_batches):
        start_time = time.time()
        _, loss_value = sess.run([train_op, loss])
        duration = time.time() - start_time
        examples_per_sec = c.batch_size / float(duration)
        print('step %d, loss = %.2f (%.1f examples/sec; %.3f '
              'sec/batch)' % (step, loss_value, examples_per_sec, duration))
这是输出:

...
step 5100, loss = 4.71 (556.3 examples/sec; 0.230 sec/batch)
step 5200, loss = 4.14 (341.9 examples/sec; 0.374 sec/batch)
step 5300, loss = 4.63 (363.4 examples/sec; 0.352 sec/batch)
step 5400, loss = 4.82 (176.0 examples/sec; 0.727 sec/batch)
最快的步骤可以处理近600个示例/秒,而如上所示,它也可以慢到约200个示例/秒

一开始,我怀疑输入管道可能是瓶颈。我使用tf.data来处理输入特征,将它们分割并馈送到不同的GPU塔。代码如下:

def create_variable_train_dataset(filenames, batch_size, feat_dim, shuffle_size=-1):
    dataset = tf.data.Dataset.from_tensor_slices(filenames).shuffle(50)
    dataset = dataset.interleave(lambda filename:
                                 tf.data.TFRecordDataset(filename).map(
                                 _parse_tfrecord, num_parallel_calls=8).shuffle(shuffle_size).apply(                                   
                            tf.contrib.data.padded_batch_and_drop_remainder(
                                   batch_size,
                                   padded_shapes=({'input': [None, feat_dim], 'input_shape': [2], 'output': []}))),
                            cycle_length=len(filenames), block_length=1
                           )

    dataset = dataset.prefetch(5)
    itr = dataset.make_initializable_iterator()
    element = itr.get_next()
    return itr, element['input'], element['output']
在主要功能中:

train_itr, train_feature, train_label = create_variable_train_dataset(train_filenames,
                                                                          batch_size=c.batch_size,
                                                                          feat_dim=feat_dim,
                                                                          shuffle_size=400000//len(train_filenames))
features_splits = tf.split(train_feature, num_or_size_splits=c.num_gpus, axis=0)

tower_grads = []
reuse_variables = None
for i in range(c.num_gpus):
    with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device=c.local_ps_device)):
        with tf.name_scope('tower_%d' % i) as scope:
            loss = _tower_loss(features_splits[i], labels_splits[i], num_classes, scope, reuse_variables)
            reuse_variables = True
            grad = ...some_function_to_compute_grad
            tower_grads.append(grads)
grads = _average_gradients(tower_grads)
def _average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

        # Average over the 'tower' dimension.
        grad = tf.concat(axis=0, values=grads)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads
...
grads = _average_gradients(tower_grads)
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
_塔架损耗是在不同GPU中产生塔架损耗的函数,参数保存在CPU中

def _tower_loss(features, labels, num_classes, scope, reuse_variables=None):
    # Build inference Graph.
    with tf.variable_scope(tf.get_variable_scope(), reuse=reuse_variables):
        logits = inference(features, num_classes, is_training=True, scope=scope)

    tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits, scope="loss")

    losses = tf.get_collection(tf.GraphKeys.LOSSES, scope)
    regularization_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    total_loss = tf.add_n(losses + regularization_losses, name='total_loss')

    # Compute the moving average of all individual losses and the total loss.
    loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
    loss_averages_op = loss_averages.apply(losses + [total_loss])

    with tf.control_dependencies([loss_averages_op]):
        total_loss = tf.identity(total_loss)

    return total_loss
接下来,我使用时间线工具检查培训期间的时间流逝。令我惊讶的是,CPU需要很长时间。这就是我所做的

start_time = time.time()
if step % 100 == 0:
    _, loss_value = sess.run([train_op, loss], options=run_options, run_metadata=run_metadata)
    duration = time.time() - start_time
    # Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()
    with open('timeline.json', 'w') as f:
        f.write(ctf)
 else:
    _, loss_value = sess.run([train_op, loss])
    duration = time.time() - start_time
这是上面最后一步的结果(步骤5400,损耗=4.82(176.0个示例/秒;0.727秒/批)):

如您所见,CPU:0需要很长时间。

Concat、Mean和ApplyAdam()花费的时间最多。它们来自_平均_梯度函数:

train_itr, train_feature, train_label = create_variable_train_dataset(train_filenames,
                                                                          batch_size=c.batch_size,
                                                                          feat_dim=feat_dim,
                                                                          shuffle_size=400000//len(train_filenames))
features_splits = tf.split(train_feature, num_or_size_splits=c.num_gpus, axis=0)

tower_grads = []
reuse_variables = None
for i in range(c.num_gpus):
    with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device=c.local_ps_device)):
        with tf.name_scope('tower_%d' % i) as scope:
            loss = _tower_loss(features_splits[i], labels_splits[i], num_classes, scope, reuse_variables)
            reuse_variables = True
            grad = ...some_function_to_compute_grad
            tower_grads.append(grads)
grads = _average_gradients(tower_grads)
def _average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

        # Average over the 'tower' dimension.
        grad = tf.concat(axis=0, values=grads)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads
...
grads = _average_gradients(tower_grads)
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
这是合理的,因为梯度应在GPU计算后平均。但是我怎样才能提高性能呢?我通过引用来实现我的模型。我使用tensorflow 1.4.0

有什么建议可以提高训练速度吗


如果任何其他代码、文件或信息有助于解决此问题,请告诉我。

我尝试将渐变平均值和渐变下降值移动到GPU:0。因为我的GPU有对等连接,所以数据移动很快,GPU中的计算也很快。将所有这些操作放在第一个GPU中几乎可以解决我的问题。如果有人有其他意见,欢迎发表意见:D

Yi Bill-如果你只问一个问题,你获得良好回复的机会会更高。查看您的代码,opt.apply_渐变(grads,global_step=global_step)意味着渐变从所有GPU发送到1个GPU(或CPU),并在该位置进行处理。梯度矩阵与整个网络一样大,将其从GPU发送到GPU是相当大的网络负载。它很可能被发送到GPU 0并在那里处理。尝试另一个实验,只计算1个变量上的梯度(如瓶颈)。@Panchishin我编辑了这个问题以关注主要问题:-)在代码的开头,我使用tf.Graph().as_default(),tf.device('/cpu:0')来澄清梯度下降是在cpu:0完成的。我知道从主机和设备传输数据将是一个巨大的开销,但tensorflow教程建议在使用多个GPU时使用此框架。现在我将参数server从cpu改为gpu:0,因为我的设备支持p2p传输。我将检查性能,稍后再尝试您的建议。