Python 按键分组tensorflow数据集并按键批处理_Python_Tensorflow_Machine Learning_Tensorflow Datasets

Python 按键分组tensorflow数据集并按键批处理

python tensorflow machine-learning

Python 按键分组tensorflow数据集并按键批处理,python,tensorflow,machine-learning,tensorflow-datasets,Python,Tensorflow,Machine Learning,Tensorflow Datasets,我目前正在处理tensorflow中的一个问题，我需要生成批次，其中批次中的所有张量都有一个特定的键值。如果可能的话，我正在尝试使用数据集api。这可能吗 Filter、map、apply都对单个元素进行操作，其中我需要一种按键分组的方法。我遇到了tf.data.experimental.group_by_window和tf.data.experimental.group_by_reducer，它们看起来很有希望，但我还没有找到解决方案最好举个例子： dataset: feature,lab

我目前正在处理tensorflow中的一个问题，我需要生成批次，其中批次中的所有张量都有一个特定的键值。如果可能的话，我正在尝试使用数据集api。这可能吗

Filter、map、apply都对单个元素进行操作，其中我需要一种按键分组的方法。我遇到了tf.data.experimental.group_by_window和tf.data.experimental.group_by_reducer，它们看起来很有希望，但我还没有找到解决方案

最好举个例子：

dataset:

feature,label
1,word1
2,word2
3,word3
1,word1
3,word3
1,word1
1,word1
2,word2
3,word3
1,word1
3,word3
1,word1
1,word1

按“键”功能分组，最大批次大小=3，给出批次：

batch1
[[1,word1],
 [1,word1],
 [1,word1]]
batch2
[[1,word1],
 [1,word1],
 [1,word1]]
batch3
[[1,word1]]
batch4
[[2,word2]
 [2,word2]]
batch5
[[3,word3],
 [3,word3],
 [3,word3]]
batch6
[[3,word3]]

编辑：尽管有示例，但每个批次的顺序并不重要

我认为这会实现您想要的转换：

import tensorflow as tf
import random

random.seed(100)
# Input data
label = list(range(15))
# Shuffle data
random.shuffle(label)
# Make feature from label data
feature = [lbl // 5 for lbl in label]
batch_size = 3

print('Data:')
print(*zip(feature, label), sep='\n')

with tf.Graph().as_default(), tf.Session() as sess:
    # Make dataset from data arrays
    ds = tf.data.Dataset.from_tensor_slices({'feature': feature, 'label': label})
    # Group by window
    ds = ds.apply(tf.data.experimental.group_by_window(
        # Use feature as key
        key_func=lambda elem: tf.to_int64(elem['feature']),
        # Convert each window to a batch
        reduce_func=lambda _, window: window.batch(batch_size),
        # Use batch size as window size
        window_size=batch_size))
    # Iterator
    iter = ds.make_one_shot_iterator().get_next()
    # Show dataset contents
    print('Result:')
    while True:
        try:
            print(sess.run(iter))
        except tf.errors.OutOfRangeError: break

输出：

数据：
(2, 11)
(1, 8)
(2, 12)
(0, 3)
(1, 9)
(0, 0)
(0, 4)
(0, 1)
(2, 10)
(1, 5)
(1, 6)
(2, 14)
(2, 13)
(1, 7)
(0, 2)
结果:
{'feature'：数组（[0,0,0]），'label'：数组（[3,0,4]）
{'feature'：数组（[2,2,2]），'label'：数组（[11,12,10]）
{'feature'：数组（[1,1,1]），'label'：数组（[8,9,5]）
{'feature'：数组（[0，0]），'label'：数组（[1，2]）}
{'feature'：数组（[1,1]），'label'：数组（[6,7]）}
{'feature'：数组（[2,2]），'label'：数组（[14,13]）}

在TF2中，您将稍微更改代码：

#从数据数组ds=tf.data.dataset.from_tensor_切片（{'feature'：feature，'label'：label}）#按窗口分组ds=ds.apply（tf.data.experional.Group#by_window（#使用feature作为键#func=lambda elem:tf.cast（elem['feature']，tf.int64），#将每个窗口转换为批处理reduce_func=lambda#，窗口：window.batch（批处理大小），#使用批处理大小作为窗口大小（窗口大小=批处理大小））#显示ds:print（元素）中元素的数据集内容print（'Result:'）