Python 2类分类模型-如何评估绩效
我正在为分类建立一个经过微调的伯特模型(最后是一个线性层)。预测值应仅为1/0(是,否) 当我写评估部分时,我看到一些人在网上为logits做了一个F.log_softmax,然后使用np.argmax获得预测标签。然而,我也看到有人在没有激活softmax的情况下直接在logits上应用np.argmax。我想知道我应该遵循哪一条以及如何决定 以下是我的模型定义:Python 2类分类模型-如何评估绩效,python,pytorch,bert-language-model,huggingface-transformers,Python,Pytorch,Bert Language Model,Huggingface Transformers,我正在为分类建立一个经过微调的伯特模型(最后是一个线性层)。预测值应仅为1/0(是,否) 当我写评估部分时,我看到一些人在网上为logits做了一个F.log_softmax,然后使用np.argmax获得预测标签。然而,我也看到有人在没有激活softmax的情况下直接在logits上应用np.argmax。我想知道我应该遵循哪一条以及如何决定 以下是我的模型定义: class ReviewClassification(BertPreTrainedModel): def __init__(sel
class ReviewClassification(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = 2
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
embedding_size = config.hidden_size
self.classifier = nn.Linear(embedding_size, len(LABEL_NAME))
self.init_weights()
def forward(
self,
review_input_ids=None,
review_attention_mask=None,
review_token_type_ids=None,
agent_input_ids=None,
agent_attention_mask=None,
agent_token_type_ids=None,
labels=None,
):
review_outputs = self.bert(
review_input_ids,
attention_mask=review_attention_mask,
token_type_ids=review_token_type_ids,
position_ids=None,
head_mask=None,
inputs_embeds=None,
)
feature = review_outputs[1] # (batch_size, seq_len) -? Should it be (batch_size, hidden_size)
# nn.CrossEntropyLoss applies F.log_softmax and nn.NLLLoss internally on your input,
# so you should pass the raw logits to it.
logits = self.classifier(feature)
outputs = (logits,) # + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss, logits, hidden_states, attentions)
这是我的验证码
def model_validate(model, data_loader):
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
label_prop = data_loader.dataset.dataset.label_prop()
total_valid_loss = 0
batch_size = data_loader.batch_size
num_batch = len(data_loader)
y_pred, y_true = [], []
# Evaluate data
for step, batch in tqdm(enumerate(data_loader), desc="Validation...", total=num_batch):
b_review_input_ids = batch["review_input_ids"].to(device)
b_review_attention_mask = batch["review_attention_mask"].to(device)
b_review_token_type_ids = batch["review_token_type_ids"].to(device)
b_binarized_label = batch["binarized_label"].to(device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
(loss, logits,) = model(review_input_ids=b_review_input_ids,
review_attention_mask=b_review_attention_mask,
review_token_type_ids=b_review_token_type_ids,
labels=b_binarized_label)
total_valid_loss += loss.item()
numpy_probas = logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
# End of an epoch of validation
# put model to train mode again.
model.train()
ave_loss = total_valid_loss / (num_batch * batch_size)
# compute the various f1 score for each label
report = classification_report(y_true, y_pred, output_dict=True)
metrics_df = pd.DataFrame(report).transpose()
metrics_df = metrics_df.sort_index()
weighted_f1_score = metrics_df.loc['weighted avg', 'f1-score']
averaged_f1_score = metrics_df.loc['macro avg', 'f1-score']
return ave_loss, metrics_df, {
"weighted": weighted_f1_score,
"averaged": averaged_f1_score,
}
我尝试的另一个版本是:
transfored_logits = F.log_softmax(logits,dim=1)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
transfored_logits = torch.sigmoid(logits)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
我尝试的第三个版本是:
transfored_logits = F.log_softmax(logits,dim=1)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
transfored_logits = torch.sigmoid(logits)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
我也不知道如何理解结果。当我在线观看时,人们会说,如果我为log_softmax设置dim=1,那么所有功能(类别)的概率之和应该为1。但是,请举例如下:
这是logits输出:(对于一个批次-批次大小=16,num_标签=2)
如果我首先应用softmax,F.log\u softmax(logits,dim=1),我会得到:
每行的总和不等于1,在我看来也不像概率
如果我使用sigmoid,火炬。sigmoid(logits)
它看起来更像是概率,尽管它的总和仍然不是1
无论我使用哪个版本,在这种情况下,预测结果总是相同的(因为我的1(是)标签的发生率非常低)
tensor([[0.7551, 0.1353],
[0.6472, 0.2405],
[0.7969, 0.1184],
[0.8875, 0.0650],
[0.7386, 0.1474],
[0.6638, 0.2377],
[0.6967, 0.2000],
[0.8276, 0.0965],
[0.5287, 0.4172],
[0.8885, 0.0681],
[0.8181, 0.1025],
[0.5278, 0.4232],
[0.7029, 0.1849],
[0.8255, 0.0930],
[0.8910, 0.0658],
[0.6854, 0.2018]], device='cuda:0')
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])