Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/307.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何筑巢?_Python_Scikit Learn_Cross Validation - Fatal编程技术网

Python 如何筑巢?

Python 如何筑巢?,python,scikit-learn,cross-validation,Python,Scikit Learn,Cross Validation,我有一个约300个点和32个不同标签的数据集,我想通过使用网格搜索和LabelKFold验证绘制学习曲线来评估LinearSVR模型 我的代码如下所示: import numpy as np from sklearn import preprocessing from sklearn.svm import LinearSVR from sklearn.pipeline import Pipeline from sklearn.cross_validation import LabelKFold

我有一个约300个点和32个不同标签的数据集,我想通过使用网格搜索和LabelKFold验证绘制学习曲线来评估LinearSVR模型

我的代码如下所示:

import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
    ...
#get data (x, y, labels)
    ...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)  

svr_estimator = Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("svr", LinearSVR),
])

search_params = dict(
    svr__C = C_space,
    svr__epsilon = epsilon_space
)

kfold = LabelKFold(labels, 5)

svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)

train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
    ...
#plot learning curve
我的问题是如何设置网格搜索和学习曲线的cv属性,以便它将我的原始集分解为训练集和测试集,而这些训练集和测试集不共享任何用于计算学习曲线的标签。然后从这些训练集中,进一步将它们分离为训练集和测试集,而不共享网格搜索的标签

基本上,如何运行嵌套的LabelKFold


一、 为这个问题创建赏金的用户使用
sklearn
提供的数据编写了以下可复制的示例

import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold

digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1

strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in

## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
                                class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)

## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)

#############################################################################
##  this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                        cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)

##########################################################################
##  this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                                cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)

从您的问题中,您正在查找数据上的LabelKFold分数,同时网格在该外部LabelKFold的每个迭代中搜索管道的参数,再次使用LabelKFold。虽然我无法实现开箱即用,但它只需要一个循环:

outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
    print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
    inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
    print "\tInner:"
    for inner_train, inner_test in inner_cv:
        print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
    clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
    clf.fit(X[outer_train],Z[outer_train])
    scores.append(clf.score(X[outer_test], Z[outer_test]))
运行代码时,第一次迭代产生:

Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11])   Test: set([9, 2, 3, 12, 6])
Inner:
    Train: set([0, 10, 11, 5, 7])   Test: set([8, 1, 4])
    Train: set([1, 4, 5, 8, 10, 11])    Test: set([0, 7])
    Train: set([0, 1, 4, 8, 7])     Test: set([10, 11, 5])

因此,很容易验证它是否按预期执行。您的交叉验证分数在列表
分数中,您可以轻松地处理这些分数。我已经使用了您在最后一段代码中定义的变量,例如
strata

这就是我必须做的,我自己做kfold循环,在单个褶皱上运行网格搜索。我是最初的提问者,但我不是那个在这个问题上悬赏的人。我不确定这是怎么回事,但我会对这个答案投赞成票,因为这是我所知道的最好的解决方案。然而,在接受答案之前,我将等待赏金持有者的回应。这看起来非常可行——我将试一试。看来这是正确的做法。非常感谢你。FWIW,@Alex,我相信只有我才能颁发赏金,所以geompalik可以在接下来的24小时内期待。