python中如何利用不平衡数据集获得更好的精度和召回率_Python_Machine Learning_Classification_Imbalanced Data

python中如何利用不平衡数据集获得更好的精度和召回率

python machine-learning

python中如何利用不平衡数据集获得更好的精度和召回率,python,machine-learning,classification,imbalanced-data,Python,Machine Learning,Classification,Imbalanced Data,我正在研究医疗保险欺诈检测模型。数据非常不平衡，有14起欺诈性阳性案例和约100万起非欺诈性案例。最初我有8个特性，但通过对分类变量进行一次热编码，我有103个特性（这是因为有94个唯一的提供者类型）。我创建了一个将逻辑回归分类器与SMOTE相结合的管道 ########## #Use pipeline - combination of SMOTE and logistic regression model # Define which resampling method and which

我正在研究医疗保险欺诈检测模型。数据非常不平衡，有14起欺诈性阳性案例和约100万起非欺诈性案例。最初我有8个特性，但通过对分类变量进行一次热编码，我有103个特性（这是因为有94个唯一的提供者类型）。我创建了一个将逻辑回归分类器与SMOTE相结合的管道

##########
#Use pipeline - combination of SMOTE and logistic regression model 
# Define which resampling method and which ML model to use in the pipeline

resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)       
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train) 
predicted = pipeline.predict(X_test)       
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))  
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted)) 
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted)) 

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

这是我的输出：

Accuracy score:  0.9333130935552119
Precision score:  2.3716352424997034e-05
Recall score:  0.09090909090909091
Classification report:
               precision    recall  f1-score   support

       False       1.00      0.93      0.97    632407
        True       0.00      0.09      0.00        11

    accuracy                           0.93    632418
   macro avg       0.50      0.51      0.48    632418
weighted avg       1.00      0.93      0.97    632418

Confusion matrix:
 [[590243  42164]
 [    10      1]]

显然，我的召回率和准确率极低，是不可接受的。如何提高召回率和准确性？我正在考虑欠采样，但如果我将负类从大约100万条记录-->14条记录更改为与正类匹配，我担心会删除太多数据。我也在考虑删除功能，但我不确定如何确定要删除哪些功能。

我们在金融欺诈检测方面遇到了类似的问题，通常实际欺诈数据低于0.1%。您必须对主要类进行欠采样，同时注意确保各种内部类的表示保持完整。因此，首先对主要群体进行聚类，然后从每个聚类中进行选择，为主要群体创建一个精简的群体。尝试使用80:20、90:10等比例，直到达到令人尊敬的精确度和召回率。像SMOTE这样的过采样技术实际上并不可取，因为在大多数情况下，合成准备的数据将与真实数据不同

尽管在数据集不平衡时可以使用一些技术，但我认为它在您的情况下不起作用。100万vs 14：不仅数据不平衡，而且14太少，无法学习。正如@Wazaki所说，你必须收集（甚至模拟自己）更多的欺诈数据；请记住，ML不是魔法。