Model 学习曲线是否显示过拟合?

Model 学习曲线是否显示过拟合?,model,scikit-learn,curve,variance,Model,Scikit Learn,Curve,Variance,我试图知道我的分类模型(二进制)是否存在过度拟合问题,我得到了学习曲线。数据集为:6836个实例,其中1006个实例为正类 1) 如果我使用SMOTE来平衡类和随机森林作为技术,我会得到这条曲线,这些比率:TPR=0.887 y FPR=0.041: 请注意,训练误差是平坦的,几乎为0 2) 如果我使用函数“balanced_subsample”(附在末尾)来平衡类和RandomForest作为技术,我会得到这条曲线,这些比率:TPR=0.866 y FPR=0.14: 请注意,在这种情况下

我试图知道我的分类模型(二进制)是否存在过度拟合问题,我得到了学习曲线。数据集为:6836个实例,其中1006个实例为正类

1) 如果我使用SMOTE来平衡类和随机森林作为技术,我会得到这条曲线,这些比率:TPR=0.887 y FPR=0.041:

请注意,训练误差是平坦的,几乎为0

2) 如果我使用函数“balanced_subsample”(附在末尾)来平衡类和RandomForest作为技术,我会得到这条曲线,这些比率:TPR=0.866 y FPR=0.14:

请注意,在这种情况下,测试错误是平坦的

  • 模型是否存在过度装配的问题
  • 其中哪一个更有意义
函数“平衡_子样本”:

EDIT2:在这种情况下,我在3种场景中尝试使用梯度增强分类器(GBC):1)GBC+SMOTE,2)GBC+SMOTE+特征选择,以及3)GBC+SMOTE+特征选择+规范化

X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
3种拟议方案的学习曲线如下:

情景1:

场景2:GBC+SMOTE+功能选择

场景3:GBC+SMOTE+特征选择+规范化
所以,你的第一条曲线是有意义的。随着训练点数的增加,您预计测试错误会减少。当你有一个没有最大深度和100%最大样本的随机树森林时,你期望一致接近0的训练误差。您可能已经过度健康,但使用RandomForests(或者,根据数据集的不同,任何其他方法)可能都不会变得更好


你的第二条曲线没有意义。您应该再次得到接近0的训练错误,除非发生了一些完全不稳定的事情(比如一个真正坏掉的输入集)。我看不出你的代码有什么问题,我运行了你的函数;看起来很好用。除了你用代码发布完整的工作示例之外,我无能为力。

请提供更多代码。我想看看你是如何进行训练/测试拆分和训练/测试的。嗨,安德鲁斯,请查看我的上一版1,在那里你可以找到有关该过程的更多详细信息。非常感谢汉克斯·安德鲁斯。我试过使用GBC,因此如果您能检查上面EDIT2的学习曲线,并检查模型是否拟合过度,我将非常感激。正如我所看到的,这些曲线看起来要好得多,在我看来,场景1是好的,并且没有过度拟合。你认为呢?在8000个训练点上,该模型在偏差和方差之间相当平衡。它看起来非常接近渐近线,这意味着它不会像将要出现的那样过度拟合。非常感谢您的评论Andreus
X = data
y = X.pop('myclass')


#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)