Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/356.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于BERT的NER分类的高度不平衡conllu数据集_Python_Pytorch_Bert Language Model - Fatal编程技术网

Python 基于BERT的NER分类的高度不平衡conllu数据集

Python 基于BERT的NER分类的高度不平衡conllu数据集,python,pytorch,bert-language-model,Python,Pytorch,Bert Language Model,我正在使用Bert for NER和conllu英语数据集。正如你所看到的,这是非常不平衡的: Counter({'O': 16348, 'B-PER': 980, 'B-ORG': 421, 'B-GPE_LOC': 112, 'B-DRV': 75, 'B-GPE_ORG': 72, 'B-PROD': 43, 'B-LOC': 39, 'B-EVT': 8}) 当我把它分为培训和验证 from conllu import parse data = parse(open("eng

我正在使用Bert for NER和conllu英语数据集。正如你所看到的,这是非常不平衡的:

Counter({'O': 16348, 'B-PER': 980, 'B-ORG': 421, 'B-GPE_LOC': 112, 'B-DRV': 75, 'B-GPE_ORG': 72, 'B-PROD': 43, 'B-LOC': 39, 'B-EVT': 8})
当我把它分为培训和验证

from conllu import parse
data = parse(open("english.conllu", "r").read())

train_df, val_df = train_test_split(data, test_size = 0.2)
然后我得到这个分布

[training] Counter({'O': 13083, 'B-PER': 795, 'B-ORG': 328, 'B-GPE_LOC': 86, 'B-DRV': 61, 'B-GPE_ORG': 59, 'B-LOC': 31, 'B-PROD': 29, 'B-EVT': 6})
[validation] Counter({'O': 3265, 'B-PER': 185, 'B-ORG': 93, 'B-GPE_LOC': 26, 'B-DRV': 14, 'B-PROD': 14, 'B-GPE_ORG': 13, 'B-LOC': 8, 'B-EVT': 2})
f1的分数是0.96,但很明显,这并不能代表模型的性能,正如你在这里看到的那样

              precision    recall  f1-score   support

       B-DRV       0.00      0.00      0.00       136
       B-EVT       0.00      0.00      0.00        36
   B-GPE_LOC       0.00      0.00      0.00       472
   B-GPE_ORG       0.00      0.00      0.00        83
       B-LOC       0.00      0.00      0.00       140
       B-ORG       0.00      0.00      0.00       623
       B-PER       0.00      0.00      0.00       949
      B-PROD       0.00      0.00      0.00       164
       I-DRV       0.00      0.00      0.00        25
       I-EVT       0.00      0.00      0.00        17
   I-GPE_LOC       0.00      0.00      0.00        49
   I-GPE_ORG       0.00      0.00      0.00         6
       I-LOC       0.00      0.00      0.00        61
       I-ORG       0.00      0.00      0.00       217
       I-PER       0.00      0.00      0.00       520
      I-PROD       0.00      0.00      0.00       172
           O       0.93      1.00      0.97     52359

    accuracy                           0.93     56029
   macro avg       0.05      0.06      0.06     56029
weighted avg       0.87      0.93      0.90     56029
我的损失函数和优化器是:

 criterion = nn.CrossEntropyLoss()
 optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
我怎样才能开始治疗这种不平衡?我试着得到权重,然后像这样在交叉熵中使用它们

idx2label = train_dataset.inverse_indexer
count_label = dict(Counter(train_dataset.flat))
inverse_counts = {key:1./value for key, value in count_label.items()}
weights = torch.Tensor(list(inverse_counts.values()))
print(weights)
sum_inverse = np.sum([count for _, count in inverse_counts.items()])
inverse_normalized =  {key:value/sum_inverse for key, value in inverse_counts.items()}
weights = np.array([0.3+inverse_normalized[idx2label[i]] for i in range(len(idx2label))])
weights = [math.log(i) for i in weights]
weights = torch.Tensor(weights)
标准=nn.CrossEntropyLoss(忽略指数=-1,权重=权重)

但结果是一样的