
python中如何利用不平衡数据集获得更好的精度和召回率,python,machine-learning,classification,imbalanced-data,Python,Machine Learning,Classification,Imbalanced Data,我正在研究医疗保险欺诈检测模型。数据非常不平衡,有14起欺诈性阳性案例和约100万起非欺诈性案例。最初我有8个特性,但通过对分类变量进行一次热编码,我有103个特性(这是因为有94个唯一的提供者类型)。我创建了一个将逻辑回归分类器与SMOTE相结合的管道 ########## #Use pipeline - combination of SMOTE and logistic regression model # Define which resampling method and which


#Use pipeline - combination of SMOTE and logistic regression model 
# Define which resampling method and which ML model to use in the pipeline

resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)       
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27), y_train) 
predicted = pipeline.predict(X_test)       
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))  
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted)) 
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted)) 

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

Accuracy score:  0.9333130935552119
Precision score:  2.3716352424997034e-05
Recall score:  0.09090909090909091
Classification report:
               precision    recall  f1-score   support

       False       1.00      0.93      0.97    632407
        True       0.00      0.09      0.00        11

    accuracy                           0.93    632418
   macro avg       0.50      0.51      0.48    632418
weighted avg       1.00      0.93      0.97    632418

Confusion matrix:
 [[590243  42164]
 [    10      1]]



尽管在数据集不平衡时可以使用一些技术,但我认为它在您的情况下不起作用。100万vs 14:不仅数据不平衡,而且14太少,无法学习。正如@Wazaki所说,你必须收集(甚至模拟自己)更多的欺诈数据;请记住,ML不是魔法。