Python 使用Scikit Learn GridSearchCV与预定义的Split进行交叉验证-令人怀疑的良好交叉验证结果

Python 使用Scikit Learn GridSearchCV与预定义的Split进行交叉验证-令人怀疑的良好交叉验证结果,python,scikit-learn,cross-validation,grid-search,Python,Scikit Learn,Cross Validation,Grid Search,我想使用scikit learn的GridSearchCV执行网格搜索,并使用预定义的开发和验证拆分(1倍交叉验证)计算交叉验证错误 恐怕我做错了什么,因为我的验证准确率高得令人怀疑。我认为我错在哪里:我将我的培训数据分为开发集和验证集,在开发集上进行培训,并在验证集上记录交叉验证分数。我的准确性可能会被夸大,因为我实际上是在开发集和验证集的混合上进行培训,然后在验证集上进行测试。我不确定是否正确使用了scikit learn的预定义拆分模块。详情如下: 接下来,我做了以下工作: imp

我想使用scikit learn的
GridSearchCV
执行网格搜索,并使用预定义的开发和验证拆分(1倍交叉验证)计算交叉验证错误

恐怕我做错了什么,因为我的验证准确率高得令人怀疑。我认为我错在哪里:我将我的培训数据分为开发集和验证集,在开发集上进行培训,并在验证集上记录交叉验证分数。我的准确性可能会被夸大,因为我实际上是在开发集和验证集的混合上进行培训,然后在验证集上进行测试。我不确定是否正确使用了scikit learn的
预定义拆分
模块。详情如下:

接下来,我做了以下工作:

    import numpy as np
    from sklearn.model_selection import train_test_split, PredefinedSplit
    from sklearn.grid_search import GridSearchCV

    # I split up my data into training and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(
        data[training_features], data[training_response], test_size=0.2, random_state=550)

    # sanity check - dimensions of training and test splits
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
    # dimensions of X_test and y_test are (80858, 26) and (80858, 1)

    ''' Now, I define indices for a pre-defined split. 
    this is a 323430 dimensional array, where the indices for the development
    set are set to -1, and the indices for the validation set are set to 0.'''

    validation_idx = np.repeat(-1, y_train.shape)
    np.random.seed(550)
    validation_idx[np.random.choice(validation_idx.shape[0], 
           int(round(.2*validation_idx.shape[0])), replace = False)] = 0

    # Now, create a list which contains a single tuple of two elements, 
    # which are arrays containing the indices for the development and
    # validation sets, respectively.
    validation_split = list(PredefinedSplit(validation_idx).split())

    # sanity check
    print(len(validation_split[0][0])) # outputs 258744 
    print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
    print(validation_idx.shape[0] == y_train.shape[0]) # True
    print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([]) 
现在,我使用
GridSearchCV
运行网格搜索。我的目的是在网格上为每个参数组合的开发集拟合一个模型,并在将结果估计器应用于验证集时记录交叉验证分数

现在,这里为我升起了一面红旗。我使用gridsearch找到的最佳估计器来确定验证集的准确性。非常高-
0.8920786868939176
。更糟糕的是,如果我在数据开发集中使用分类器(我刚刚在其上进行了培训)0.89295597192591902,它几乎与我获得的精度相同但是-当我在真正的测试集上使用分类器时,我得到的准确度要低得多,大约是
.78

    # accurracy score on the validation set. This yields .89207865
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][1]]),
           y_true=y_train[validation_split[0][1]])

    # accuracy score when applied to the development set. This yields .8929559
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][0]]),
           y_true=y_train[validation_split[0][0]])

    # finally, the score when applied to the test set. This yields .783 
    accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
对我来说,模型应用于开发和验证数据集时的准确度与应用于测试集时的准确度之间几乎完全一致,这清楚地表明我无意中接受了验证数据方面的培训,因此,我的交叉验证分数不能代表模型的真实准确性

我似乎找不到哪里出错了——主要是因为当它接收到一个预定义的split对象作为
cv
参数的参数时,我不知道
GridSearchCV
在引擎盖下做什么

知道我哪里出错了吗?如果你需要更多的细节,请告诉我。代码也在其中


谢谢

您需要设置
refit=False
(不是默认选项),否则网格搜索将在网格搜索完成后在整个数据集上重新安装估计器(忽略cv)。

是的,验证数据存在数据泄漏问题。您需要为
GridSearchCV
设置
refit=False
,它不会重新安装整个数据,包括培训和验证数据。

安装后,请查看GridSearchCV的外观。您可以获得关于每个折叠的训练和测试折叠分数的信息(在您的案例中为1)。
    # accurracy score on the validation set. This yields .89207865
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][1]]),
           y_true=y_train[validation_split[0][1]])

    # accuracy score when applied to the development set. This yields .8929559
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][0]]),
           y_true=y_train[validation_split[0][0]])

    # finally, the score when applied to the test set. This yields .783 
    accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)