Python 使用新的TensorFlow数据集API读取TFRecord图像数据_Python_Tensorflow_Dataset

Python 使用新的TensorFlow数据集API读取TFRecord图像数据

python tensorflow

Python 使用新的TensorFlow数据集API读取TFRecord图像数据,python,tensorflow,dataset,Python,Tensorflow,Dataset,使用“新”（TensorFlow v1.4）数据集API读取TFRecord格式的图像数据时遇到问题。我认为问题在于，在尝试读取数据时，我不知何故消耗了整个数据集，而不是一批数据。我在这里有一个使用批处理/文件队列API的工作示例：（在这个示例中，我运行了一个分类器，但是读取TFRecord图像的代码在DataReaders.py类中）我认为，问题在于： def parse_mnist_tfrec(tfrecord, features_shape): tfrecord_features

使用“新”（TensorFlow v1.4）数据集API读取TFRecord格式的图像数据时遇到问题。我认为问题在于，在尝试读取数据时，我不知何故消耗了整个数据集，而不是一批数据。我在这里有一个使用批处理/文件队列API的工作示例：（在这个示例中，我运行了一个分类器，但是读取TFRecord图像的代码在

DataReaders.py

类中）

我认为，问题在于：

def parse_mnist_tfrec(tfrecord, features_shape):
    tfrecord_features = tf.parse_single_example(
        tfrecord,
        features={
            'features': tf.FixedLenFeature([], tf.string),
            'targets': tf.FixedLenFeature([], tf.string)
        }
    )
    features = tf.decode_raw(tfrecord_features['features'], tf.uint8)
    features = tf.reshape(features, features_shape)
    features = tf.cast(features, tf.float32)
    targets = tf.decode_raw(tfrecord_features['targets'], tf.uint8)
    targets = tf.one_hot(indices=targets, depth=10, on_value=1, off_value=0)
    targets = tf.cast(targets, tf.float32)
    return features, targets

class MNISTDataReaderDset:
    def __init__(self, data_reader_dict):
        # doesn't matter here

    def batch_generator(self, num_epochs=1):
        def parse_fn(tfrecord):
            return parse_mnist_tfrec(
                tfrecord, self.name, self.features_shape
            )
        dataset = tf.data.TFRecordDataset(
            self.filenames_list, compression_type=self.compression_type
        )
        dataset = dataset.map(parse_fn)
        dataset = dataset.repeat(num_epochs)
        dataset = dataset.batch(self.batch_size)
        iterator = dataset.make_one_shot_iterator()
        batch_features, batch_labels = iterator.get_next()
        return batch_features, batch_labels

然后，在使用中：

        batch_features, batch_labels = \
            data_reader.batch_generator(num_epochs=1)

        sess.run(tf.local_variables_initializer())
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        try:
            # look at 3 batches only
            for _ in range(3):
                labels, feats = sess.run([
                    batch_labels, batch_features
                ])

这会产生如下错误：

 [[Node: Reshape_1 = Reshape[T=DT_UINT8, Tshape=DT_INT32](DecodeRaw_1, Reshape_1/shape)]]
 Input to reshape is a tensor with 50000 values, but the requested shape has 1
 [[Node: Reshape_1 = Reshape[T=DT_UINT8, Tshape=DT_INT32](DecodeRaw_1, Reshape_1/shape)]]
 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,28,28,1], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

有人有什么想法吗

我在读者示例中提供了完整代码的要点，并在此处提供了指向TFRecord文件（我们的老朋友MNIST，TFRecord格式）的链接：

谢谢

编辑-我还尝试了平面地图，例如：

def batch_generator(self, num_epochs=1):
    """
    TODO - we can use placeholders for the list of file names and
    init with a feed_dict when we call `sess.run` - give this a
    try with one list for training and one for validation
    """
    def parse_fn(tfrecord):
        return parse_mnist_tfrec(
            tfrecord, self.name, self.features_shape
        )
    dataset = tf.data.Dataset.from_tensor_slices(self.filenames_list)
    dataset = dataset.flat_map(
        lambda filename: (
            tf.data.TFRecordDataset(
                filename, compression_type=self.compression_type
            ).map(parse_fn).batch(self.batch_size)
        )
    )
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

我还试着只使用一个文件，而不是一个列表（这是我第一次使用上面的方法）。不管怎样，TF似乎总是想将整个文件吃掉到

TFRecordDataset

中，并且不会对单个记录进行操作。

好的，我发现了这一点-上面的代码很好。问题是我创建TFR记录的脚本。基本上，我有一个这样的街区

def write_tfrecord(reader, start_idx, stop_idx, tfrecord_file):
    writer = tf.python_io.TFRecordWriter(tfrecord_file)
    tfeat, ttarg = get_binary_data(reader, start_idx, stop_idx)
    example = tf.train.Example(
        features=tf.train.Features(
            feature={
                'features': tf.train.Feature(
                    bytes_list=tf.train.BytesList(value=[tfeat])
                ),
                'targets': tf.train.Feature(
                    bytes_list=tf.train.BytesList(value=[ttarg])
                )
            }
        )
    )
    writer.write(example.SerializeToString())
    writer.close()

我需要一个像这样的街区：

def write_tfrecord(reader, start_idx, stop_idx, tfrecord_file):
    writer = tf.python_io.TFRecordWriter(tfrecord_file)
    for idx in range(start_idx, stop_idx):
        tfeat, ttarg = get_binary_data(reader, idx)
        example = tf.train.Example(
            features=tf.train.Features(
                feature={
                    'features': tf.train.Feature(
                        bytes_list=tf.train.BytesList(value=[tfeat])
                    ),
                    'targets': tf.train.Feature(
                        bytes_list=tf.train.BytesList(value=[ttarg])
                    )
                }
            )
        )
        writer.write(example.SerializeToString())
    writer.close()

也就是说，当我需要在数据中为每个示例制作一条记录时，我基本上是将我的整个数据块作为一个巨大的TFRecord来编写的

事实证明，如果在旧的文件和批处理队列API中使用任何一种方法，一切都会正常工作-像

tf.train.batch

这样的函数都会自动神奇地“智能化”，可以将大的块分割开来，也可以根据所给的内容将大量单个示例记录连接到一个批处理中。当我修复生成TFRecords文件的代码时，我不需要更改旧文件和批处理队列代码中的任何内容，它仍然可以很好地使用TFRecords文件。但是，

Dataset

API对这种差异很敏感。这就是为什么在我上面的代码中，它似乎总是在消耗整个文件——因为整个文件实际上是一个大记录