ValueError:Can'；t使用tf.data.Dataset.from_Tensor_切片时，将非矩形Python序列转换为张量_Python_Tensorflow_Huggingface Transformers

ValueError:Can'；t使用tf.data.Dataset.from_Tensor_切片时，将非矩形Python序列转换为张量

python tensorflow

ValueError:Can'；t使用tf.data.Dataset.from_Tensor_切片时，将非矩形Python序列转换为张量,python,tensorflow,huggingface-transformers,Python,Tensorflow,Huggingface Transformers,这个问题已经在SO上发布了好几次，但我仍然不知道我的代码有什么问题，特别是因为它来自于年的一个教程，并且作者在google上提供了代码我曾看到其他用户对错误的变量类型有问题（这不是我的情况，因为我的模型输入是标记器的输出），甚至看到我尝试使用的函数（tf.data.Dataset.from_tensor_slices）被建议作为解决方案线路屈服误差为： # train dataset ds_train_encoded = encode_examples(ds_train).shuffle(1

这个问题已经在SO上发布了好几次，但我仍然不知道我的代码有什么问题，特别是因为它来自于年的一个教程，并且作者在google上提供了代码

我曾看到其他用户对错误的变量类型有问题（这不是我的情况，因为我的模型输入是

标记器的输出），甚至看到我尝试使用的函数（tf.data.Dataset.from_tensor_slices
）被建议作为解决方案
线路屈服误差为：
# train dataset
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)

其中方法encode\u examples
定义为（我已在encode\u examples
方法中插入了assert
行，以确保我的问题不是长度不匹配）：
数据是这样加载的（在这里，我更改了数据集，使其只获得10%的训练数据，以便加快调试速度）
另外两个调用（convert_example_to_feature
和map_example_to_dict
）和标记器如下所示：
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
def convert_example_to_feature(text):
    # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
    return tokenizer.encode_plus(text,
                                 add_special_tokens = True, # add [CLS], [SEP]
                                 #max_length = max_length, # max length of the text that can go to BERT
                                 pad_to_max_length = True, # add [PAD] tokens
                                 return_attention_mask = True,)# add attention mask to not focus on pad tokens

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return ({"input_ids": input_ids,
            "token_type_ids": token_type_ids,
            "attention_mask": attention_masks,
            }, label)

我怀疑错误可能与TensorFlow的不同版本有关（我使用的是2.3），但不幸的是，由于内存原因，我无法在google.colab笔记本中运行这些代码段
有人知道我的代码哪里出了问题吗？感谢您的时间和关注。
原来是我评论了这句话造成了麻烦
#max_length = max_length, # max length of the text that can go to BERT

我假设它将截断模型的最大大小，或者它将以最长的输入作为最大大小。它什么都不做，然后即使我有相同数量的条目，这些条目的大小也不同，生成一个非矩形张量
我已经删除了
，并使用512作为最大长度。这是伯特所能承受的最大值。（请参阅以供参考）
另一个可能的原因是应在标记器中显式启用截断。参数为truncation=True

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
def convert_example_to_feature(text):
    # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
    return tokenizer.encode_plus(text,
                                 add_special_tokens = True, # add [CLS], [SEP]
                                 #max_length = max_length, # max length of the text that can go to BERT
                                 pad_to_max_length = True, # add [PAD] tokens
                                 return_attention_mask = True,)# add attention mask to not focus on pad tokens

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return ({"input_ids": input_ids,
            "token_type_ids": token_type_ids,
            "attention_mask": attention_masks,
            }, label)

#max_length = max_length, # max length of the text that can go to BERT