Tensorflow 数据集API、迭代器和tf.contrib.data.retainment\u重采样_Tensorflow_Iterator

Tensorflow 数据集API、迭代器和tf.contrib.data.retainment\u重采样

tensorflow

Tensorflow 数据集API、迭代器和tf.contrib.data.retainment\u重采样,tensorflow,iterator,Tensorflow,Iterator,[在@mrry comment之后编辑#1] 我正在使用数据集API以及tf.contrib.data.retainment\u重采样为输入培训管道设置特定的分布函数在将tf.contrib.data.rejection\u重采样添加到输入之前，我使用了一次性迭代器。唉，当开始使用后者时，我尝试使用dataset.make_initializable_iterator（）-这是因为我们正在引入管道有状态变量，并且需要在输入管道中的所有变量都是init之后初始化迭代器。正如@mrry所写的那

[在@mrry comment之后编辑#1] 我正在使用数据集API以及tf.contrib.data.retainment\u重采样为输入培训管道设置特定的分布函数

在将tf.contrib.data.rejection\u重采样添加到输入之前，我使用了一次性迭代器。唉，当开始使用后者时，我尝试使用dataset.make_initializable_iterator（）-这是因为我们正在引入管道有状态变量，并且需要在输入管道中的所有变量都是init之后初始化迭代器。正如@mrry所写的那样

我将输入传递给一个估计器，并用一个实验进行包装

问题是-在哪里钩住迭代器的init？如果我尝试：

dataset = dataset.batch(batch_size)
if self.balance:
   dataset = tf.contrib.data.rejection_resample(dataset, self.class_mapping_function, self.dist_target)
   iterator = dataset.make_initializable_iterator()
   tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)
else:
   iterator = dataset.make_one_shot_iterator() 

image_batch, label_batch = iterator.get_next()
print (image_batch)

以及映射功能：

def class_mapping_function(self, feature, label):
    """
    returns a a function to be used with dataset.map() to return class numeric ID
    The function is mapping a nested structure of tensors (having shapes and types defined by dataset.output_shapes
    and dataset.output_types) to a scalar tf.int32 tensor. Values should be in [0, num_classes).
    """
    # For simplicity, trying to return the label itself as I assume its numeric...

    return tf.cast(label, tf.int32)  # <-- I guess this is the bug

但当使用可初始化迭代器时，它缺少张量形状信息：

Tensor("train_input_fn/IteratorGetNext:0", shape=(?,), dtype=int32, device=/device:CPU:0)

任何帮助都将不胜感激

[编辑#2]-在@mrry评论之后，它看起来像另一个数据集] 这里真正的问题可能不是迭代器的init序列，而是tf.contrib.data.retainment\u resample使用的映射函数，该函数返回tf.int32。但是我想知道映射函数应该如何定义？要保持数据集形状为（？，100100,3），例如

[编辑#3]：从拒绝的实施(重新采样)

class_values_ds = dataset.map(class_func)

因此，类_func将获取一个数据集并返回一个tf.int32的数据集是有意义的。

在@mrry响应之后，我可以想出一个解决方案，说明如何将数据集API与tf.contrib.data.rejection_重采样（使用TF1.3）一起使用

目标

给定具有某种分布的特征/标签数据集，让输入管道将分布重塑为特定的目标分布

数值示例

假设我们正在构建一个网络，将某些特征划分为10个类别之一。假设我们只有100个标签随机分布的特征。
30个要素标记为1类，5个要素标记为2类等等在培训期间，我们不希望选择第1类而不是第2类，因此我们希望每个小批量对所有类进行统一分配

解决方案

使用tf.contrib.data.rejection\u重采样将允许为我们的输入管道设置特定的分布

在文件中，它说tf.contrib.data.retainment\u重新采样需要时间

（1）数据集-要平衡的数据集是哪一个

（2）类_func-该函数仅从原始数据集生成新的数字标签数据集

（3） target_dist-一个向量，大小为类的数量，具体为所需的新分布

（4）更多可选值-暂时跳过

正如文档所说，它返回一个“数据集”

结果表明，输入数据集的形状与输出数据集的形状不同。因此，返回的数据集（如TF1.3中所述）应由用户进行如下过滤：

    balanced_dataset = tf.contrib.data.rejection_resample(input_dataset,
                                                          self.class_mapping_function,
                                                          self.target_distribution)

    # Return to the same Dataset shape as was the original input
    balanced_dataset = balanced_dataset.map(lambda _, data: (data))

关于迭代器类的一个注释。正如@mrry所解释的，在管道中使用有状态对象时，应该使用可初始化迭代器，而不是热迭代器。请注意，在使用可初始化迭代器时，应将init_op添加到表_初始值设定项中，否则将收到以下错误：“GetNext（）失败，因为迭代器尚未初始化。”

代码示例：

# Creating the iterator, that allows to access elements from the dataset
if self.use_balancing:
    # For balancing function, we use stateful variables in the sense that they hold current dataset distribution
    # and calculate next distribution according to incoming examples.
    # For dataset pipeline that have state, one_shot iterator will not work, and we are forced to use
    # initializable iterator
    # This should be relaxed in the future.
    # https://stackoverflow.com/questions/44374083/tensorflow-cannot-capture-a-stateful-node-by-value-in-tf-contrib-data-api
    iterator = dataset.make_initializable_iterator()
    tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)

else:
    iterator = dataset.make_one_shot_iterator()

image_batch, label_batch = iterator.get_next()

有效吗？

对。以下是从Tensorboard采集的两幅图像，它们是在输入管道标签上采集的直方图。原始输入标签均匀分布。场景A：尝试实现以下10类分布： [0.1，0.4，0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.1,0.1]

结果是：

场景B：尝试实现以下10类分布： [0.1,0.1,0.05,0.05,0.05,0.05,0.05,0.05，0.4，0.1]

结果是：

下面是一个简单的示例来演示的用法（感谢@Agade的想法）

导入数学
导入tensorflow作为tf
将numpy作为np导入
def打印_数据集（名称、数据集）：
elems=np.array（[v.numpy（）表示数据集中的v]）
打印（“数据集{}包含{}个元素：“.format（name，len（elems）））
打印（元素）
def组合数据集平衡（数据集更小、大小更小、数据集更大、大小更大、批量大小）：
ds_-minger\u repeated=dataset_-minger.repeat（count=int（math.ceil（size\u-biger/size\u-minger）））
#我们重复较小的数据集，以便这两个数据集的大小大致相同
平衡数据集=tf.data.experimental.sample来自数据集（[ds\u较小的数据集\u重复，数据集\u较大]，权重=[0.5,0.5]）
#结果数据集中的每个元素都是从数据集中随机抽取的（不替换），即使概率为0.5，也可以从奇数中抽取，概率为0.5
balanced_dataset=balanced_dataset.take（2*size_bigger）。batch（batch_size）
返回平衡的数据集
N、 M=3，10
偶数=tf.data.Dataset.range（0，2*N，2）。重复（count=int（math.ceil（M/N）））
奇数=tf.data.Dataset.range（1,2*M,2）
偶数=组合数据集平衡（偶数，N，奇数，M，2）
打印数据集（“偶数”，偶数）
打印数据集（“奇数”，奇数）
打印数据集（“偶数、奇数、全部”，偶数、奇数）

你能分享代码吗？看起来这两个迭代器是从两个不同的

Dataset

对象创建的，这可能解释了为什么它们有不同的推断形状（和类型！）。嗨@mrry。谢谢你这么快回复！我添加了所有的信息，我想你肯定发现了问题。我相信这不是init过程，而是对tf.contrib.data.resample中映射函数的滥用。如果您同意这一点，我将非常感谢您能就如何定义映射函数发表意见，因为我找不到这方面的参考。谢谢可能重复的输出

# Creating the iterator, that allows to access elements from the dataset
if self.use_balancing:
    # For balancing function, we use stateful variables in the sense that they hold current dataset distribution
    # and calculate next distribution according to incoming examples.
    # For dataset pipeline that have state, one_shot iterator will not work, and we are forced to use
    # initializable iterator
    # This should be relaxed in the future.
    # https://stackoverflow.com/questions/44374083/tensorflow-cannot-capture-a-stateful-node-by-value-in-tf-contrib-data-api
    iterator = dataset.make_initializable_iterator()
    tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)

else:
    iterator = dataset.make_one_shot_iterator()

image_batch, label_batch = iterator.get_next()

Output :

Dataset even contains 12 elements :  # 12 = 4 x N  (because of .repeat)
[0 2 4 0 2 4 0 2 4 0 2 4]
Dataset odd contains 10 elements :
[ 1  3  5  7  9 11 13 15 17 19]
Dataset even_odd contains 10 elements :  # 10 = 2 x M / 2  (2xM because of .take(2 * M) and /2 because of .batch(2))
[[ 0  2]
 [ 1  4]
 [ 0  2]
 [ 3  4]
 [ 0  2]
 [ 4  0]
 [ 5  2]
 [ 7  4]
 [ 0  9]
 [ 2 11]]