Python 如何在Tensorflow中有效地使用tf.bucket_by_sequence_length？_Python_Tensorflow_Deep Learning_Bucket

Python 如何在Tensorflow中有效地使用tf.bucket_by_sequence_length？

python tensorflow deep-learning

Python 如何在Tensorflow中有效地使用tf.bucket_by_sequence_length？,python,tensorflow,deep-learning,bucket,Python,Tensorflow,Deep Learning,Bucket,因此，我尝试使用Tensorflow中的tf.bucket\u by\u sequence\u length（），但无法完全理解如何使其工作基本上，它应该将序列（不同长度）作为输入，并将序列桶作为输出，但它似乎不是这样工作的从这次讨论中：我的印象是，它需要一个队列来按顺序为这个函数提供信息。不过还不清楚函数的文档可以在这里找到：事实上，您需要输入张量作为队列，它可以是tf.FIFOQueue（）.deque（），或者tf.TensorArray（）.read（tf.train.range

因此，我尝试使用Tensorflow中的tf.bucket\u by\u sequence\u length（），但无法完全理解如何使其工作

基本上，它应该将序列（不同长度）作为输入，并将序列桶作为输出，但它似乎不是这样工作的

从这次讨论中：我的印象是，它需要一个队列来按顺序为这个函数提供信息。不过还不清楚

函数的文档可以在这里找到：

事实上，您需要输入张量作为队列，它可以是

tf.FIFOQueue（）.deque（）

，或者

tf.TensorArray（）.read（tf.train.range\u input\u producer（））

这本笔记本很好地解释了这一点：

我下面的答案是基于Tensorflow2.0的。我可以看出您可能正在使用旧版本的Tensorflow。但是，如果您碰巧使用了新版本，您可以通过以下方式有效地使用bucket_by_sequence_length API

# This will be used by bucket_by_sequence_length to batch them according to their length.
def _element_length_fn(x, y=None):
    return array_ops.shape(x)[0]


# These are the upper length boundaries for the buckets.
# Based on these boundaries, the sentences will be shifted to different buckets.
boundaries = [upper_boundary_for_batch] # Here you will have to define the upper boundaries for different buckets. You can have as many boundaries as you want. But make sure that the upper boundary contains the maximum length of the sentence that is in your dataset.

# These defines the batch sizes for different buckets.
# I am keeping the batch_size for each bucket same, but this can be changed based on more analysis.
# As per the documentation - batch size per bucket. Length should be len(bucket_boundaries) + 1.
# https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length
batch_sizes = [batch_size] * (len(boundaries) + 1)

# Bucket_by_sequence_length returns a dataset transformation function that has to be applied using dataset.apply.
# Here the important parameter is pad_to_bucket_boundary. If this is set to true then, the sentences will be padded to
# the bucket boundaries provided. If set to False, it will pad the sentences to the maximum length found in the batch.
# Default value for padding is 0, so we do not need to supply anything extra here.
dataset = dataset.apply(tf.data.experimental.bucket_by_sequence_length(_element_length_fn, boundaries,
                                                                       batch_sizes,
                                                                       drop_remainder=True,
                                                                       pad_to_bucket_boundary=True))