Model 学习曲线是否显示过拟合？_Model_Scikit Learn_Curve_Variance

Model 学习曲线是否显示过拟合？

model scikit-learn

Model 学习曲线是否显示过拟合？,model,scikit-learn,curve,variance,Model,Scikit Learn,Curve,Variance,我试图知道我的分类模型（二进制）是否存在过度拟合问题，我得到了学习曲线。数据集为：6836个实例，其中1006个实例为正类 1）如果我使用SMOTE来平衡类和随机森林作为技术，我会得到这条曲线，这些比率：TPR=0.887 y FPR=0.041：请注意，训练误差是平坦的，几乎为0 2）如果我使用函数“balanced_subsample”（附在末尾）来平衡类和RandomForest作为技术，我会得到这条曲线，这些比率：TPR=0.866 y FPR=0.14：请注意，在这种情况下

我试图知道我的分类模型（二进制）是否存在过度拟合问题，我得到了学习曲线。数据集为：6836个实例，其中1006个实例为正类

1）如果我使用SMOTE来平衡类和随机森林作为技术，我会得到这条曲线，这些比率：TPR=0.887 y FPR=0.041：

请注意，训练误差是平坦的，几乎为0
2）如果我使用函数“balanced_subsample”（附在末尾）来平衡类和RandomForest作为技术，我会得到这条曲线，这些比率：TPR=0.866 y FPR=0.14：

请注意，在这种情况下，测试错误是平坦的

模型是否存在过度装配的问题

其中哪一个更有意义

函数“平衡_子样本”：
EDIT2：在这种情况下，我在3种场景中尝试使用梯度增强分类器（GBC）：1）GBC+SMOTE，2）GBC+SMOTE+特征选择，以及3）GBC+SMOTE+特征选择+规范化

X = data y = X.pop('myclass') #There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes arrX = vectorize_attributes(X) #FOR SCENARIO 3: Normalization standardized_X = preprocessing.normalize(arrX) #FOR SCENARIO 2 y 3: Removing all but the k highest scoring features arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y) #Here I use some code to balance my class using SMOTE or "balanced_subsample" approach X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y) #X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) #TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit) X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25) #Estimator clf=RandomForestClassifier(random_state=np.random.seed()) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} #Grid search score_func = metrics.f1_score CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10) start = time() CV_clf.fit(X_train, y_train) #FIT & PREDICTION model = CV_clf.best_estimator_ y_pred = model.predict(X_test)
3种拟议方案的学习曲线如下：
情景1：
场景2：GBC+SMOTE+功能选择
场景3:GBC+SMOTE+特征选择+规范化
所以，你的第一条曲线是有意义的。随着训练点数的增加，您预计测试错误会减少。当你有一个没有最大深度和100%最大样本的随机树森林时，你期望一致接近0的训练误差。您可能已经过度健康，但使用RandomForests（或者，根据数据集的不同，任何其他方法）可能都不会变得更好

你的第二条曲线没有意义。您应该再次得到接近0的训练错误，除非发生了一些完全不稳定的事情（比如一个真正坏掉的输入集）。我看不出你的代码有什么问题，我运行了你的函数；看起来很好用。除了你用代码发布完整的工作示例之外，我无能为力。
请提供更多代码。我想看看你是如何进行训练/测试拆分和训练/测试的。嗨，安德鲁斯，请查看我的上一版1，在那里你可以找到有关该过程的更多详细信息。非常感谢汉克斯·安德鲁斯。我试过使用GBC，因此如果您能检查上面EDIT2的学习曲线，并检查模型是否拟合过度，我将非常感激。正如我所看到的，这些曲线看起来要好得多，在我看来，场景1是好的，并且没有过度拟合。你认为呢？在8000个训练点上，该模型在偏差和方差之间相当平衡。它看起来非常接近渐近线，这意味着它不会像将要出现的那样过度拟合。非常感谢您的评论Andreus
X = data y = X.pop('myclass') #There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes arrX = vectorize_attributes(X) #Here I use some code to balance my class using SMOTE or "balanced_subsample" approach X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y) #X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) #TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit) X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25) #Estimator clf=RandomForestClassifier(random_state=np.random.seed()) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} #Grid search score_func = metrics.f1_score CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10) start = time() CV_clf.fit(X_train, y_train) #FIT & PREDICTION model = CV_clf.best_estimator_ y_pred = model.predict(X_test)

X = data y = X.pop('myclass') #There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes arrX = vectorize_attributes(X) #FOR SCENARIO 3: Normalization standardized_X = preprocessing.normalize(arrX) #FOR SCENARIO 2 y 3: Removing all but the k highest scoring features arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y) #Here I use some code to balance my class using SMOTE or "balanced_subsample" approach X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y) #X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) #TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit) X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25) #Estimator clf=RandomForestClassifier(random_state=np.random.seed()) param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']} #Grid search score_func = metrics.f1_score CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10) start = time() CV_clf.fit(X_train, y_train) #FIT & PREDICTION model = CV_clf.best_estimator_ y_pred = model.predict(X_test)