每批训练时间的TensorFlow持续增加

每批训练时间的TensorFlow持续增加,tensorflow,Tensorflow,我的每批培训时间持续增长,如下所示。下面我还提供了我的代码的大致摘要。我不确定这是否是因为在某个地方我不知怎么改变了我的图表,但我无法找到它 2018-05-02 01:56:18 step 0, train loss = 3.362801 2018-05-02 01:58:17 step 10, train loss = 3.589638 2018-05-02 02:01:43 step 20, train loss = 3.214278 2018-05-02 02:06:53 step 30

我的每批培训时间持续增长,如下所示。下面我还提供了我的代码的大致摘要。我不确定这是否是因为在某个地方我不知怎么改变了我的图表,但我无法找到它

2018-05-02 01:56:18 step 0, train loss = 3.362801
2018-05-02 01:58:17 step 10, train loss = 3.589638
2018-05-02 02:01:43 step 20, train loss = 3.214278
2018-05-02 02:06:53 step 30, train loss = 2.952656
2018-05-02 02:13:59 step 40, train loss = 2.856005
2018-05-02 02:22:30 step 50, train loss = 2.802824
2018-05-02 02:32:06 step 60, train loss = 2.735146
2018-05-02 02:55:46 step 70, train loss = 2.671062
2018-05-02 03:07:00 step 80, train loss = 2.596556
2018-05-02 03:18:54 step 90, train loss = 2.536373
2018-05-02 03:31:27 step 100, train loss = 2.492104
2018-05-02 03:43:51 step 110, train loss = 2.446146
2018-05-02 03:56:32 step 120, train loss = 2.412097
2018-05-02 04:10:06 step 130, train loss = 2.389571
2018-05-02 05:13:31 step 140, train loss = 2.344358
2018-05-02 05:53:27 step 150, train loss = 2.343909
2018-05-02 07:30:38 step 160, train loss = 2.329638
2018-05-02 09:58:32 step 170, train loss = 2.297986
...


import input_pipeline  
Class Trainer(...):
def __init__(...):
    ...
    self.model = Model(conf, ...) # model constructor
def train(self):
    ...
    with tf.device(device):
        # construct input pipeline
        input_pipe_train = input_pipeline.InputPipe(self.conf, self.train_feat_scp, self.train_text_scp, len(self.worker_hosts), self.task_index)

        # get a batch from input pipeline
        inputs, targets, input_seq_lengths, target_seq_lengths = input_pipe_train.next_batch()

        # forward computation
        logits, logit_seq_lengths = self.model(inputs=inputs, input_seq_lengths=input_seq_lengths, targets=targets, target_seq_lengths=target_seq_lengths)

        # loss computation
        self.model.loss(targets, logits, logit_seq_lengths, target_seq_lengths, self.loss_func)

        global_step = tf.train.get_or_create_global_step()

        # model parameter update
        update = self.model.update(loss, global_step, self.lrate, self.grad_clip)

        # variable and dataset iterator initialization op
        init_op = tf.global_variables_initializer()
        init_data_op = input_pipe_train.iterator.initializer

    with tf.train.MonitoredTrainingSession(master=master, is_chief=is_chief, config=config, hooks=hooks) as mon_sess:

        mon_sess.run(init_op)
        mon_sess.run(init_data_op)
        while not mon_sess.should_stop():
            _, lossVal, step = mon_sess.run([update, loss, tf.train.get_global_step()])
            if step % 10 == 0:
                cur_time = strftime("%Y-%m-%d %H:%M:%S", localtime())
                print('%s step %d, train loss = %f' % (cur_time, step, lossVal))
以下用于模型和输入管道

class Model(...):
def __init__(...):
    ...
def __call__(...): # feedforward computation
    ...
    return logits, logit_seq_lengths
def loss(...):
    #some matrix manipulation
    ...
    return tf.contrib.seq2seq.sequence_loss(logits, expanded_targets, weights)

def update(...):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    grads_and_vars = optimizer.compute_gradients(loss=loss)
    ...    
    update_op = optimizer.apply_gradients(grads_and_vars=clipped_grads_and_vars, global_step=global_step, name='apply_gradients')
    return update_op

class InputPipe(object):
def __init__(...):
    # get a list of tfrecord filenames
    ...

    # tf Dataset stuff
    dataset = tf.data.TFRecordDataset(...)
    dataset = dataset.map(...)
    dataset = dataset.zip(...)
    dataset = dataset.shard(...)
    dataset = dataset.padded_batch(...)

    #some special data shuffle strategy with dataset 
    ...

    self.iterator = self.dataset.make_initializable_iterator()

def next_batch(self):
    return self.iterator.get_next()
请问有谁能帮忙找出这个问题吗


谢谢

我想这是因为您在训练循环中反复调用
tf.train.get\u global\u step()
。根据经验,您应该避免在创建会话(尤其是在循环中)后调用任何
tf
ops,因为这会将它们添加到图表中,从而逐渐降低速度。尝试定义
tf.train.get_global_step()
一次,并重复运行同一个操作,就像您使用
update
loss
等操作一样。

感谢您指出这一点,我将其从会话运行中删除,但仍然会导致类似的每批训练时间增加行为。您是否尝试过使用tfprof和tfdbg来了解其中一些步骤与其他步骤有何不同?