使用Tensorflow的动态RNN降低多GPU机器上的训练速度

使用Tensorflow的动态RNN降低多GPU机器上的训练速度,tensorflow,Tensorflow,我有两台可用的机器,可以在上面训练使用Tensorflow构建的模型:一台本地桌面机器,有一个GPU(以下称为“本地”),一台远程集群,有4个GPU(以下称为“集群”)。即使集群有4个GPU,我一次也只使用一个GPU(例如通过CUDA\u VISIBLE\u DEVICES=2 python script.py)。我的问题是,在集群上训练完全相同的模型要比在本地机器上慢得多,即使集群有更强大的GPU。我意识到这个问题可能非常局限,很难找出原因,但是我不知道是什么导致了这种行为。在下文中,我尝试给

我有两台可用的机器,可以在上面训练使用Tensorflow构建的模型:一台本地桌面机器,有一个GPU(以下称为“本地”),一台远程集群,有4个GPU(以下称为“集群”)。即使集群有4个GPU,我一次也只使用一个GPU(例如通过
CUDA\u VISIBLE\u DEVICES=2 python script.py
)。我的问题是,在集群上训练完全相同的模型要比在本地机器上慢得多,即使集群有更强大的GPU。我意识到这个问题可能非常局限,很难找出原因,但是我不知道是什么导致了这种行为。在下文中,我尝试给出关于这两台机器的配置和我正在构建的模型的尽可能多的细节

模型 该模型是一个简单的玩具RNN取自该项目。模型定义如下:

# Parameters
learning_rate = 0.01
training_steps = 600
batch_size = 128
display_step = 200

# Network Parameters
seq_max_len = 20  # Sequence max length
n_hidden = 64  # hidden layer num of features
n_classes = 2  # linear sequence or not

# tf Graph input
x = tf.placeholder("float", [None, seq_max_len, 1])
y = tf.placeholder("float", [None, n_classes])
# A placeholder for indicating each sequence length
seqlen = tf.placeholder(tf.int32, [None])

# Define weights
weights = {
    'out': tf.Variable(tf.random_normal([n_hidden, n_classes]))
    }
biases = {
    'out': tf.Variable(tf.random_normal([n_classes]))
    }


def dynamicRNN(x, seqlen, weights, biases):
    # Prepare data shape to match `rnn` function requirements
    # Current data input shape: (batch_size, n_steps, n_input)
    # Required shape: 'n_steps' tensors list of shape (batch_size, n_input)

    with tf.device('gpu:0'):
        # Unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
        x = tf.unstack(x, seq_max_len, 1)

        # Define a lstm cell with tensorflow
        lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)

        # Get lstm cell output, providing 'sequence_length' will perform dynamic
        # calculation.
        outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32,
                                                    sequence_length=seqlen)

        # When performing dynamic calculation, we must retrieve the last
        # dynamically computed output, i.e., if a sequence length is 10, we need
        # to retrieve the 10th output.
        # However TensorFlow doesn't support advanced indexing yet, so we build
        # a custom op that for each sample in batch size, get its length and
        # get the corresponding relevant output.

        # 'outputs' is a list of output at every timestep, we pack them in a Tensor
        # and change back dimension to [batch_size, n_step, n_input]
        outputs = tf.stack(outputs)
        outputs = tf.transpose(outputs, [1, 0, 2])

        # Hack to build the indexing and retrieve the right output.
        batch_size = tf.shape(outputs)[0]
        # Start indices for each sample
        index = tf.range(0, batch_size)*seq_max_len+(seqlen-1)
        # Indexing
        outputs = tf.gather(tf.reshape(outputs, [-1, n_hidden]), index)

        # Linear activation, using outputs computed above
        return tf.matmul(outputs, weights['out'])+biases['out']


pred = dynamicRNN(x, seqlen, weights, biases)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
完整的(可运行的)python脚本可在以下位置找到:

本地配置
  • Tensorflow:v1.3(已安装预编译版本)
  • CUDA:v8.0.61
  • cuDNN:v6.0.21
  • GPU:GeForce GTX TITAN X
  • NVIDIA驱动程序:375.82
  • 操作系统:Ubuntu 16.04,64位
集群上的配置 与本地的完全相同,除了:

  • GPU:GeForce GTX TITAN X Pascal
  • NVIDIA驱动程序:375.66
业绩指标 执行上述提供的玩具脚本,我在本地获得以下输出:

Step 128, Minibatch Loss= 0.725320, Training Accuracy= 0.43750, Time: 0.3180224895477295
Step 25600, Minibatch Loss= 0.683126, Training Accuracy= 0.50962, Time: 0.013816356658935547
Step 51200, Minibatch Loss= 0.680907, Training Accuracy= 0.50000, Time: 0.013682842254638672
Step 76800, Minibatch Loss= 0.677346, Training Accuracy= 0.57692, Time: 0.014072895050048828
以及群集上的以下内容:

Step 128, Minibatch Loss= 1.536499, Training Accuracy= 0.47656, Time: 0.8308820724487305
Step 25600, Minibatch Loss= 0.693901, Training Accuracy= 0.49038, Time: 0.06193065643310547
Step 51200, Minibatch Loss= 0.689709, Training Accuracy= 0.53846, Time: 0.05762457847595215
Step 76800, Minibatch Loss= 0.685955, Training Accuracy= 0.54808, Time: 0.06454324722290039
如您所见,集群上的执行时间大约高出4倍。我试图通过使用时间线功能来分析GPU上正在发生的事情。我发现很难解释这个特性的输出,但我发现最引人注目的是集群上存在巨大的空闲间隙。为此,请参见以下显示一次调用
sess.run
的时间轴功能跟踪的图像(请注意,时间轴的比例在两个图像中并不完全相同,但差异仍应可见)

集群上的时间线:

当地时间线:


你们中有人观察到同样的行为吗?可能导致此行为的原因有哪些?如果进一步缩小问题范围,又该如何解决?

等等,哪个时间线跟踪是本地的,哪个是群集的?@Engineero Ups,忘记了,现在添加了标题。等等,哪个时间线跟踪是本地的,哪个是群集的?@Engineero Ups,忘记了,现在添加了标题。