Python Tensorflow 2.0拥抱面变压器，TFBertForSequenceClassification，推理中意外的输出尺寸_Python_Tensorflow_Machine Learning_Nlp_Huggingface Transformers

Python Tensorflow 2.0拥抱面变压器，TFBertForSequenceClassification，推理中意外的输出尺寸

python tensorflow machine-learning nlp

Python Tensorflow 2.0拥抱面变压器，TFBertForSequenceClassification，推理中意外的输出尺寸,python,tensorflow,machine-learning,nlp,huggingface-transformers,Python,Tensorflow,Machine Learning,Nlp,Huggingface Transformers,摘要：我想对自定义数据集上的句子分类进行微调。我遵循了我发现的一些例子，比如，非常有用。我也看过我遇到的问题是，当对某些样本运行推断时，输出的维度超出了我的预期。当我对23个样本运行推断时，我得到了一个具有numpy维度数组（1472，42）的元组，其中42是类的数量。我期望维度（23,42）代码和其他详细信息：我使用Keras在经过训练的模型上运行推理，如下所示： preds = model.predict(features) 其中要素被标记化并转换为数据集： for sample

摘要：

我想对自定义数据集上的句子分类进行微调。我遵循了我发现的一些例子，比如，非常有用。我也看过

我遇到的问题是，当对某些样本运行推断时，输出的维度超出了我的预期。

当我对23个样本运行推断时，我得到了一个具有numpy维度数组（1472，42）的元组，其中42是类的数量。我期望维度（23,42）

代码和其他详细信息：

我使用Keras在经过训练的模型上运行推理，如下所示：

preds = model.predict(features)

其中要素被标记化并转换为数据集：

for sample, ground_truth in tests:
    test_examples.append(InputExample(text=sample, category_index=ground_truth))

features = convert_examples_to_tf_dataset(test_examples, tokenizer)

其中，

sample

可以是，例如，

“我想要分类的测试句子”

，

ground\u truth

可以是，例如，

，它是编码标签。因为我做推理，我所提供的基本事实当然不重要

convert\u examples\u to\u tf\u dataset

-函数如下所示（我在中找到）：

它按照我的预期工作，运行

print（list（features.as\u numpy\u iterator（））[1]）

会产生以下结果：

({'input_ids': array([  101, 11639, 19962, 23288, 13264, 35372, 10410,   102,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0], dtype=int32), 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32), 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)}, 6705)

到目前为止，一切都像我预料的那样。而且似乎标记器正在正常工作；3个长度为64的数组（对应于我的“设置最大长度”），一个标签为整数

该模型经过如下培训：

config = BertConfig.from_pretrained(
    'bert-base-multilingual-cased',
    num_labels=len(label_encoder.classes_),
    output_hidden_states=False,
    output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)

# train_data is then a tf.data.Dataset we can pass to model.fit()
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=[metric])

model.summary()

history = model.fit(train_data,
                    epochs=EPOCHS,
                    steps_per_epoch=train_steps,
                    validation_data=val_data,
                    validation_steps=val_steps,
                    shuffle=True,
                    )

结果

现在的问题是，当运行预测

preds=model.predict（features）

时，输出维度与所说的内容不一致：

logits（Numpy数组或tf.Tensor形状（batch\u size，config.num\u标签））：

我得到的是一个元组，包含维度为：（1472,42）的numpy数组。

42是有道理的，因为这是我的课数。我为测试发送了23个样本，23x64=1472。64是我的最大句子长度，所以听起来很熟悉。这个输出是否不正确？如何将此输出转换为每个输入样本的实际类预测？我得到了1472个预测，而我预计是23个

请让我知道我是否可以提供更多细节来帮助解决这个问题。

我报告了我的示例，其中我尝试预测3个文本样本，并获得（3,42）作为输出形状

### define model
config = BertConfig.from_pretrained(
    'bert-base-multilingual-cased',
    num_labels=42,
    output_hidden_states=False,
    output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=[metric])

### import tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

### utility functions for text encoding
def return_id(str1, str2, length):

    inputs = tokenizer.encode_plus(str1, str2,
        add_special_tokens=True,
        max_length=length)

    input_ids =  inputs["input_ids"]
    input_masks = [1] * len(input_ids)
    input_segments = inputs["token_type_ids"]

    padding_length = length - len(input_ids)
    padding_id = tokenizer.pad_token_id

    input_ids = input_ids + ([padding_id] * padding_length)
    input_masks = input_masks + ([0] * padding_length)
    input_segments = input_segments + ([0] * padding_length)

    return [input_ids, input_masks, input_segments]

### encode 3 sentences
input_ids, input_masks, input_segments = [], [], []
for instance in ['hello hello', 'ciao ciao', 'marco marco']:

    ids, masks, segments = \
    return_id(instance, None, 100)

    input_ids.append(ids)
    input_masks.append(masks)
    input_segments.append(segments)

input_ = [np.asarray(input_ids, dtype=np.int32), 
          np.asarray(input_masks, dtype=np.int32), 
          np.asarray(input_segments, dtype=np.int32)]

### make prediction
model.predict(input_).shape # ===> (3,42)

我发现了问题-如果在使用Tensorflow数据集（tf.data.Dataset）时得到意外的维度，可能是因为没有运行

.batch

在我的例子中：

features = convert_examples_to_tf_dataset(test_examples, tokenizer)

添加：

features = features.batch(BATCH_SIZE)

使这项工作如我所料。因此，这不是与

TFBertForSequenceClassification

相关的问题，只是因为我的输入不正确。我还想添加对的引用，这使我发现了问题

features = features.batch(BATCH_SIZE)