Python 张量a（707）的大小必须与张量b（512）在非单态维度1的大小相匹配_Python_Tensorflow_Pytorch_Tokenize_Bert Language Model

Python 张量a（707）的大小必须与张量b（512）在非单态维度1的大小相匹配

python tensorflow pytorch

Python 张量a（707）的大小必须与张量b（512）在非单态维度1的大小相匹配,python,tensorflow,pytorch,tokenize,bert-language-model,Python,Tensorflow,Pytorch,Tokenize,Bert Language Model,我正在尝试使用预训练的BERT模型进行文本分类。在测试阶段，我在我的数据集上训练了模型；我知道BERT只能获取512个令牌，所以我编写了if条件来检查数据帧中测试感知的长度。如果长度超过512，我将句子分成多个序列，每个序列有512个标记。然后进行标记器编码。Sequence的长度是512，但是，在进行标记化编码之后，长度变为707，我得到了这个错误 The size of tensor a (707) must match the size of tensor b (512) at non-s

我正在尝试使用预训练的BERT模型进行文本分类。在测试阶段，我在我的数据集上训练了模型；我知道BERT只能获取512个令牌，所以我编写了if条件来检查数据帧中测试感知的长度。如果长度超过512，我将句子分成多个序列，每个序列有512个标记。然后进行标记器编码。Sequence的长度是512，但是，在进行标记化编码之后，长度变为707，我得到了这个错误

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

以下是我用于执行上述步骤的代码：

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math

pred=[]
if (len(test_sentence_in_df.split())>512):
  
  n=math.ceil(len(test_sentence_in_df.split())/512)
  for i in range(n):
    if (i==(n-1)):
      print(i)
      test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
    else:
      print("i in else",str(i))
      test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
      
      #print(len(test_sentence.split()))  ##here's the length is 512
    tokenized_sentence = tokenizer.encode(test_sentence)
    input_ids = torch.tensor([tokenized_sentence]).cuda()
    print(len(tokenized_sentence)) #### here's the length is 707
    with torch.no_grad():
      output = model(input_ids)
      label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
    pred.append(label_indices)

print(pred)

这是因为，BERT使用单词片段标记化。因此，当一些单词不在词汇表中时，它会将这些单词拆分为它的单词片段。例如：如果单词

playing

不在词汇表中，它可以分解为

play、#ing

。这会在标记化后增加给定句子中的标记数量。您可以指定某些参数以获得固定长度标记化：

tokenized\u-session=tokenizer.encode（test\u-session，padding=True，truncation=True，max\u-length=50，add\u-special\u-tokens=True）

如果

encode（）

函数不起作用，那么

batch\u-encode\u-plus（）

肯定能起作用。