Python 为什么tensorflow代码会崩溃？_Python_Machine Learning_Tensorflow

Python 为什么tensorflow代码会崩溃？

python machine-learning tensorflow

Python 为什么tensorflow代码会崩溃？,python,machine-learning,tensorflow,Python,Machine Learning,Tensorflow,我已经建立了一个用于图像分类的玩具模型。该程序结构松散，类似于。培训一开始很顺利，但最终计划失败了。我已经完成了图表，以防在某个地方添加了ops，在tensorboard中，它看起来很棒，但毫无疑问，它最终会冻结并强制硬重启（或长时间等待最终重启）。退出使它看起来像一个GPU内存问题，但模型很小，应该适合。如果我分配完整的GPU内存（再分配4gb），它仍然会崩溃数据是存储在tfrecords文件中的256x256x3图像和标签。培训功能代码如下所示： def train(): with

我已经建立了一个用于图像分类的玩具模型。该程序结构松散，类似于。培训一开始很顺利，但最终计划失败了。我已经完成了图表，以防在某个地方添加了ops，在tensorboard中，它看起来很棒，但毫无疑问，它最终会冻结并强制硬重启（或长时间等待最终重启）。退出使它看起来像一个GPU内存问题，但模型很小，应该适合。如果我分配完整的GPU内存（再分配4gb），它仍然会崩溃

数据是存储在tfrecords文件中的256x256x3图像和标签。培训功能代码如下所示：

def train():
    with tf.Graph().as_default():
         global_step = tf.contrib.framework.get_or_create_global_step()
         train_images_batch, train_labels_batch = distorted_inputs(batch_size=BATCH_SIZE)
         train_logits = inference(train_images_batch)
         train_batch_loss = loss(train_logits, train_labels_batch)
         train_op = training(train_batch_loss, global_step, 0.1)

         merged = tf.summary.merge_all()
         saver = tf.train.Saver(tf.global_variables())
         gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.75)
         sess_config=tf.ConfigProto(gpu_options=gpu_options)
         sess = tf.Session(config=sess_config)
         train_summary_writer = tf.summary.FileWriter(
         os.path.join(ROOT, 'logs', 'train'), sess.graph)
         init = tf.global_variables_initializer()

         sess.run(init)
         coord = tf.train.Coordinator()
         threads = tf.train.start_queue_runners(sess=sess, coord=coord)

         tf.Graph().finalize()
         for i in range(5540):
             start_time = time.time()
             summary, _, batch_loss = sess.run([merged, train_op, train_batch_loss])
             duration = time.time() - start_time
             train_summary_writer.add_summary(summary, i)
             if i % 10 == 0:
                 msg = 'batch: {} loss: {:.6f} time: {:.8} sec/batch'.format(
                 i, batch_loss, str(time.time() - start_time))
                 print(msg)
         coord.request_stop()
         coord.join(threads)
         sess.close()

损失和训练op分别为交叉熵和adam优化器：

def loss(logits, labels):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='cross_entropy_per_example')
    xentropy_mean = tf.reduce_mean(xentropy, name='cross_entropy')
    tf.add_to_collection('losses', xentropy_mean)
    return xentropy_mean

def training(loss, global_step, learning_rate):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(loss, global_step=global_step)
    return train_op

并且批处理是使用

 def distorted_inputs(batch_size):
     filename_queue = tf.train.string_input_producer(
         ['data/train.tfrecords'], num_epochs=None)
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,
        features={'label': tf.FixedLenFeature([], tf.int64),
                  'image': tf.FixedLenFeature([], tf.string)})
    label = features['label']
    label = tf.cast(label, tf.int32)
    image = tf.decode_raw(features['image'], tf.uint8)
    image = (tf.cast(image, tf.float32) / 255) - 0.5
    image = tf.reshape(image, shape=[256, 256, 3])
    # data augmentation
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_flip_left_right(image)
    print('filling the queue with {} images ' \
          'before starting to train'.format(MIN_QUEUE_EXAMPLES))
    return _generate_batch(image, label, MIN_QUEUE_EXAMPLES, BATCH_SIZE)

及

我遗漏了什么？

所以我解决了这个问题。这是一个解决方案，以防对其他人有用。TL，DR:这是一个硬件问题

具体来说，这是一个PCIe总线错误，与投票最多的错误相同。正如所建议的，这可能是由于消息信号中断与PLX开关不兼容造成的。同样在该线程中解决了这个问题，设置内核参数

pci=nommconf

以禁用msi

在Tensorflow、Torch和Theano之间，tf是引发这一问题的唯一深度学习框架。为什么，我不确定。

嘿，我不确定你在问什么。。。什么不起作用？如果您运行的是Windows，您可以在windbg中加载崩溃转储并检查哪个模块失败。该程序运行良好，直到冻结我的机器，通常在800次或更多次迭代之后（批大小64或128，似乎无关紧要）。所以我怀疑（GPU）内存泄漏（整个过程中RAM的使用是稳定的），但我不清楚在哪里（如果这是问题的话）。既然在tensorflow程序运行期间无法从中获取详细的GPU内存使用信息，我想知道那些拥有比我更多tensorflow经验的人是否在我的代码中看到了罪魁祸首。我对图形模型还是有点不熟悉。如果有帮助的话，我正在特斯拉k40上运行Ubuntu 16.04、TF1.0.0和python 3.5。

def _generate_batch(image, label,
                    min_queue_examples=MIN_QUEUE_EXAMPLES,
                    batch_size=BATCH_SIZE):
    images_batch, labels_batch = tf.train.shuffle_batch(
        [image, label], batch_size=batch_size,
        num_threads=12, capacity=min_queue_examples + 3 * BATCH_SIZE,
        min_after_dequeue=min_queue_examples)
    tf.summary.image('images', images_batch)
    return images_batch, labels_batch