Python 不完全批次上的Tensorflow序列

Python 不完全批次上的Tensorflow序列,python,tensorflow,neural-network,training-data,Python,Tensorflow,Neural Network,Training Data,我正在尝试在tensorflow中进行批量培训。因为我可以批量完成第一个历元,所以这有点效果。 目前我的代码有两个问题。 1.第一个历元完成后,第二个历元立即转到,除了tf.errors.OutOfRangeError,下一个历元不会从顶部重新启动批处理。我怎样才能再进行一次批量生产呢? 2.我打印batchnr,我注意到最后一批epoch打印print(batchnr),但不打印print(End batchnr),并转到“除外”并没有接受培训。这是因为队列中剩余的行数小于我猜的批大小。我怎样

我正在尝试在tensorflow中进行批量培训。因为我可以批量完成第一个历元,所以这有点效果。 目前我的代码有两个问题。
1.第一个历元完成后,第二个历元立即转到
,除了tf.errors.OutOfRangeError
,下一个历元不会从顶部重新启动批处理。我怎样才能再进行一次批量生产呢?
2.我打印batchnr,我注意到最后一批epoch打印
print(batchnr)
,但不打印
print(End batchnr)
,并转到“除外”并没有接受培训。这是因为队列中剩余的行数小于我猜的批大小。我怎样才能继续训练最后一批

我的训练方法和管道方法

def input_pipeline(file, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer([file], num_epochs=num_epochs, shuffle=True)
  example, label = read_from_csv(filename_queue)
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * 2
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

def train():
    examples, labels = input_pipeline(training_data_file, batch_size, 1)
    saver = tf.train.Saver()
    prediction = neural_network_model(p_inputdata)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=p_known_labels))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

    init = tf.group(tf.initialize_all_variables(),
                    tf.initialize_local_variables())
    with tf.Session() as sess:
        sess.run(init) # initialize all variables **in** the session

        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(p_known_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

        latest_cost_of_batch = None
        for e in range(epochs):
            epoch = e + 1
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(coord=coord)
            try:
                batchnr = 1
                while not coord.should_stop():
                    print(batchnr)
                    batch_data, batch_labels = sess.run([examples, labels])
                    batch_labels_output = get_output_values(batch_labels)
                    print("End", batchnr)
                    batchnr += 1

                    _, latest_cost_of_batch = sess.run([optimizer,cost], feed_dict={
                        p_inputdata: batch_data,
                        p_known_labels: batch_labels_output
                    })

            except tf.errors.OutOfRangeError:
                print('Done training, epoch reached')
                if (epoch) % print_each_x_number_of_epochs == 0 or epoch == 0:
                    print('Epoch', epoch, 'completed out of', epochs, "---", 'Loss', latest_cost_of_batch)
                if epoch % save_each_x_number_of_epochs == 0:
                    saver.save(sess, checkpoint_label)
            finally:
                coord.request_stop()
        coord.join(threads)

        print("Trained for ", epochs,"epochs. Saving variables...")
        saver.save(sess, checkpoint_label)
        print("Variables saved. Training finished.")
    end = time.time()
    seconds = end - start
    print("Total runtime:", str(datetime.timedelta(seconds=seconds)))
调试控制台

Start training
1
End 1
2
End 2
....
213
End 213
214
Done training, epoch reached
Epoch 1 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 2 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 3 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 4 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 5 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 6 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 7 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 8 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 9 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 10 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 11 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 12 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 13 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 14 completed out of 15 --- Loss 4.43414
1
Done training, epoch reached
Epoch 15 completed out of 15 --- Loss 4.43414
Trained for  15 epochs. Saving variables...
Variables saved. Training finished.
Accuracy 0.935310311615 % after 15 epochs of training
Total runtime: 0:00:21.395917
编辑
我根据尼古拉斯的回答更改了代码(我在
string\u input\u producer
)中使用了多个epoch)。现在,我有了以下代码的培训:

def train():
    """Trains the neural network  
    """
    examples, labels = input_pipeline(training_data_file, batch_size, epochs)
    start = time.time()
    saver = tf.train.Saver()
    prediction = neural_network_model(p_inputdata)
    first_no_loss = True
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=p_known_labels))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

    init = tf.group(tf.initialize_all_variables(),
                    tf.initialize_local_variables())
    with tf.Session() as sess:
        sess.run(init) # initialize all variables **in** the session
        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(p_known_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

        print("Start training")
        latest_cost_of_batch = None

        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        epoch_op = "input_producer/limit_epochs/epochs:0"
        try:
            batchnr = 1
            epochs_var = 0
            while not coord.should_stop():
                if (batchnr) % print_each_x_number_of_batches == 0:
                    print('Batch', batchnr, 'completed of epoch', epochs_var, "---", 'Loss', latest_cost_of_batch)

                if  batchnr > 3194:
                    print("GETTING BATCH", batchnr)
                epochs_var, batch_data, batch_labels = sess.run([epoch_op, examples, labels])
                batch_labels_output = get_output_values(batch_labels)
                if  batchnr > 3194:
                    print("GOT BATCH", batchnr)
                batchnr += 1
                _, latest_cost_of_batch = sess.run([optimizer,cost], feed_dict={
                    p_inputdata: batch_data,
                    p_known_labels: batch_labels_output
                })

        except tf.errors.OutOfRangeError:
            print('Done training, epoch reached')
        finally:
            coord.request_stop()

        coord.join(threads)

        print("Trained for ", epochs,"epochs. Saving variables...")
        saver.save(sess, checkpoint_label)
        print("Variables saved. Training finished.")
        labels, values, output = get_training_or_testdata(training_data_file)
        print('Accuracy', accuracy.eval(feed_dict={p_inputdata: values, p_known_labels: output}) * 100, '% after', epochs, 'epochs of training')
    end = time.time()
    seconds = end - start
    print("Total runtime:", str(datetime.timedelta(seconds=seconds)))
我的输出是这样的

Start training
Batch 100 completed of epoch 15 --- Loss 4.79351
Batch 200 completed of epoch 15 --- Loss 4.57468
Batch 300 completed of epoch 15 --- Loss 4.51134
Batch 400 completed of epoch 15 --- Loss 4.65865
Batch 500 completed of epoch 15 --- Loss 4.55456
Batch 600 completed of epoch 15 --- Loss 4.63549
Batch 700 completed of epoch 15 --- Loss 4.53037
Batch 800 completed of epoch 15 --- Loss 4.49263
Batch 900 completed of epoch 15 --- Loss 4.37
Batch 1000 completed of epoch 15 --- Loss 4.42719
Batch 1100 completed of epoch 15 --- Loss 4.4518
Batch 1200 completed of epoch 15 --- Loss 4.41053
Batch 1300 completed of epoch 15 --- Loss 4.43508
Batch 1400 completed of epoch 15 --- Loss 4.32173
Batch 1500 completed of epoch 15 --- Loss 4.36624
Batch 1600 completed of epoch 15 --- Loss 4.44027
Batch 1700 completed of epoch 15 --- Loss 4.37201
Batch 1800 completed of epoch 15 --- Loss 4.24956
Batch 1900 completed of epoch 15 --- Loss 4.40256
Batch 2000 completed of epoch 15 --- Loss 4.18391
Batch 2100 completed of epoch 15 --- Loss 4.30156
Batch 2200 completed of epoch 15 --- Loss 4.38423
Batch 2300 completed of epoch 15 --- Loss 4.23823
Batch 2400 completed of epoch 15 --- Loss 4.17783
Batch 2500 completed of epoch 15 --- Loss 4.31024
Batch 2600 completed of epoch 15 --- Loss 4.26312
Batch 2700 completed of epoch 15 --- Loss 4.26143
Batch 2800 completed of epoch 15 --- Loss 4.16691
Batch 2900 completed of epoch 15 --- Loss 4.48624
Batch 3000 completed of epoch 15 --- Loss 4.1347
Batch 3100 completed of epoch 15 --- Loss 4.20801
GETTING BATCH 3195
GOT BATCH 3195
GETTING BATCH 3196
GOT BATCH 3196
GETTING BATCH 3197
Done training, epoch reached
Trained for  15 epochs. Saving variables...
Variables saved. Training finished.
Accuracy 2.69019026309 % after 15 epochs of training
Total runtime: 0:03:07.577149
我注意到的是,最后一批仍然没有得到训练(
get batch 3197
没有打印),其次,获取当前历元的方法是不正确的。总是15。解释了为什么我现在这样做不是正确的方式,但它没有解释一个正确的方式来获得当前的时代。有什么线索吗?


编辑:您可能想看看这个,因为它给出了一个新API的示例。 这是你得到的解释

  • 当您第一次通过
    for e in range(epochs)
    循环时,它将从您的数据队列中退出所有内容(直到数据队列抛出
    tf.errors.OutOfRangeError

    当文件名队列中没有其他文件名时,会引发此错误。这是因为您调用了
    示例,labels=input\u pipeline(training\u data\u file,batch\u size,1)

    例如,如果您调用了
    examples,labels=input\u pipeline(training\u data\u file,batch\u size,3)
    ,那么在移动到
    e=1
    之前,您可能已经浏览了3次文件

  • 然后,当您移动到
    e>0
    时,文件名队列在内存中保留了您已经将所有文件名退出队列的信息,并且由于不再进行排队操作,它直接抛出
    tf.errors.OutOfRangeError

    请参阅字符串文档:

    注意:如果
    num_epochs
    不是
    None
    ,此函数将创建本地计数器
    时代
    。使用
    local\u variables\u initializer()
    初始化局部变量

你能做什么

  • 您可以将会话上下文管理器移动到范围(EPOCH)中的e的
    循环中:

    init_queue = tf.variables_initializer(tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope='input_producer'))`
    with tf.Session() as sess:
        sess.run(init)
    for e in range(EPOCHS):
        with tf.Session() as sess:
            sess.run(init_queue) # initialize all local variables **in** the the input_producer scope
            epoch = e + 1
    
    这意味着您需要重新初始化
    input\u producer
    范围中的所有局部变量,因此您需要小心它们是什么。 您还可以保存模型并在每个步骤中再次加载,或者

  • 您依靠
    num_epochs
    参数来运行正确数量的epochs,并删除
    for e in range(epochs)
    循环。不必在每个历元结束时打印信息,您可以每100或1000个训练步骤打印一次信息(我最喜欢的解决方案)。如果确实希望在每个历元结束时打印信息,可以尝试访问隐藏的
    epochs
    变量,评估其值,并在“epochs”发生更改时打印信息(我不建议使用此选项)

  • 例如:

        batchnr = 0
        tmp_batchnr = 0
        while not coord.should_stop():
                if batchnr != tmp_batchnr:
                    print(....)
                    batchnr = tmp_batchnr
                epochs_var, _, _ = sess.run([epochs_var, examples, labels])
                print("End", batchnr)
                batchnr += 1
    
    希望有帮助

    对编辑后的问题的评论:

    从你提到的答案来看,这句话中强调的是,在我看来,你无法知道退出队列属于哪个时代

    执行tf.start_queue_runners()时,所有历元一起排队(如果容量小于文件名数,则分多个阶段)。tf.train.string_input_producer使用局部变量epochs:0来维护正在排队的epoch。一旦epochs:0达到num_epochs,它将保持不变,并且无论有多少线程从队列中退出,它都不会改变


    你能展示一下
    输入管道
    方法的代码吗?@Nicolas我为你添加了它。请看一看你可能想看看这个答案@Nicolas我离毕业还有一个月。不确定我现在是否要切换到RC。它看起来正是我想要的。是的,它确实帮了我的忙。谢谢:)还有一些事情我需要弄清楚,如编辑后的问题所示。很高兴知道!关于您编辑的问题,我看不出您在代码中的何处打印了
    get BATCH
    语句。我的旧代码粘贴不好。现在更新它。关于你的编辑,我可能不得不放弃纪元印刷。。。但我还是想训练最后一批,如果我问了一个愚蠢的问题,我道歉:但是你真的期望4197整批吗?(历元数、批量大小和总行数是多少?)在我看来,TF试图检索4197e批次,但此时所有数据都已出列,出现了
    TF.errors.OutOfRangeError
    。我有31968行和150和15个历元的批次<代码>31968*15/150=3196.8。因此,我有足够的数据用于3196个批次和最后一个包含120行的非完整批次。最后一个会抛出错误,因为它无法填充最后30行。但我还是想用最后120行进行训练。问题是:怎么办?