Python 2类分类模型-如何评估绩效

Python 2类分类模型-如何评估绩效,python,pytorch,bert-language-model,huggingface-transformers,Python,Pytorch,Bert Language Model,Huggingface Transformers,我正在为分类建立一个经过微调的伯特模型(最后是一个线性层)。预测值应仅为1/0(是,否) 当我写评估部分时,我看到一些人在网上为logits做了一个F.log_softmax,然后使用np.argmax获得预测标签。然而,我也看到有人在没有激活softmax的情况下直接在logits上应用np.argmax。我想知道我应该遵循哪一条以及如何决定 以下是我的模型定义: class ReviewClassification(BertPreTrainedModel): def __init__(sel

我正在为分类建立一个经过微调的伯特模型(最后是一个线性层)。预测值应仅为1/0(是,否)

当我写评估部分时,我看到一些人在网上为logits做了一个F.log_softmax,然后使用np.argmax获得预测标签。然而,我也看到有人在没有激活softmax的情况下直接在logits上应用np.argmax。我想知道我应该遵循哪一条以及如何决定

以下是我的模型定义:

class ReviewClassification(BertPreTrainedModel):
def __init__(self, config):
    super().__init__(config)
    self.num_labels = 2

    self.bert = BertModel(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)

    embedding_size = config.hidden_size

    self.classifier = nn.Linear(embedding_size, len(LABEL_NAME))
    self.init_weights()

def forward(
        self,
        review_input_ids=None,
        review_attention_mask=None,
        review_token_type_ids=None,
        agent_input_ids=None,
        agent_attention_mask=None,
        agent_token_type_ids=None,
        labels=None,
):

    review_outputs = self.bert(
        review_input_ids,
        attention_mask=review_attention_mask,
        token_type_ids=review_token_type_ids,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
    )
    
    feature = review_outputs[1]  # (batch_size, seq_len) -? Should it be (batch_size, hidden_size)

    # nn.CrossEntropyLoss applies F.log_softmax and nn.NLLLoss internally on your input,
    # so you should pass the raw logits to it.
    logits = self.classifier(feature)

    outputs = (logits,)  # + outputs[2:]  # add hidden states and attention if they are here

    if labels is not None:
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        outputs = (loss,) + outputs

    return outputs  # (loss, logits, hidden_states, attentions) 
这是我的验证码

def model_validate(model, data_loader):
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

label_prop = data_loader.dataset.dataset.label_prop()

total_valid_loss = 0

batch_size = data_loader.batch_size
num_batch = len(data_loader)

y_pred, y_true = [], []

# Evaluate data
for step, batch in tqdm(enumerate(data_loader), desc="Validation...", total=num_batch):
    b_review_input_ids = batch["review_input_ids"].to(device)
    b_review_attention_mask = batch["review_attention_mask"].to(device)
    b_review_token_type_ids = batch["review_token_type_ids"].to(device)

    b_binarized_label = batch["binarized_label"].to(device)

    # Tell pytorch not to bother with constructing the compute graph during
    # the forward pass, since this is only needed for backprop (training).
    with torch.no_grad():
        (loss, logits,) = model(review_input_ids=b_review_input_ids,
                                review_attention_mask=b_review_attention_mask,
                                review_token_type_ids=b_review_token_type_ids,

                                labels=b_binarized_label)

    total_valid_loss += loss.item()
    numpy_probas = logits.detach().cpu().numpy()
    y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
    y_true.extend(b_binarized_label.cpu().numpy())
# End of an epoch of validation

# put model to train mode again.
model.train()

ave_loss = total_valid_loss / (num_batch * batch_size)

# compute the various f1 score for each label
report = classification_report(y_true, y_pred, output_dict=True)
metrics_df = pd.DataFrame(report).transpose()


metrics_df = metrics_df.sort_index()

weighted_f1_score = metrics_df.loc['weighted avg', 'f1-score']
averaged_f1_score = metrics_df.loc['macro avg', 'f1-score']

return ave_loss, metrics_df, {
    "weighted": weighted_f1_score,
    "averaged": averaged_f1_score,
}
我尝试的另一个版本是:

transfored_logits = F.log_softmax(logits,dim=1)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
transfored_logits = torch.sigmoid(logits)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
我尝试的第三个版本是:

transfored_logits = F.log_softmax(logits,dim=1)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
transfored_logits = torch.sigmoid(logits)
numpy_probas = transfored_logits.detach().cpu().numpy()
y_pred.extend(np.argmax(numpy_probas, axis=1).flatten())
y_true.extend(b_binarized_label.cpu().numpy())
我也不知道如何理解结果。当我在线观看时,人们会说,如果我为log_softmax设置dim=1,那么所有功能(类别)的概率之和应该为1。但是,请举例如下:

这是logits输出:(对于一个批次-批次大小=16,num_标签=2)

如果我首先应用softmax,F.log\u softmax(logits,dim=1),我会得到:

每行的总和不等于1,在我看来也不像概率

如果我使用sigmoid,火炬。sigmoid(logits)

它看起来更像是概率,尽管它的总和仍然不是1

无论我使用哪个版本,在这种情况下,预测结果总是相同的(因为我的1(是)标签的发生率非常低)

tensor([[0.7551, 0.1353],
    [0.6472, 0.2405],
    [0.7969, 0.1184],
    [0.8875, 0.0650],
    [0.7386, 0.1474],
    [0.6638, 0.2377],
    [0.6967, 0.2000],
    [0.8276, 0.0965],
    [0.5287, 0.4172],
    [0.8885, 0.0681],
    [0.8181, 0.1025],
    [0.5278, 0.4232],
    [0.7029, 0.1849],
    [0.8255, 0.0930],
    [0.8910, 0.0658],
    [0.6854, 0.2018]], device='cuda:0')
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])