Python 你的电脑坏了吗?

Python 你的电脑坏了吗?,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我正在尝试使用网格搜索来找出n_组件在PCA中使用的最佳值: from sklearn.decomposition import PCA from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression

我正在尝试使用网格搜索来找出n_组件在PCA中使用的最佳值:

from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


pca = PCA()
pipe_lr = Pipeline([('pca', pca),
                    ('regr', LinearRegression())])

param_grid = [{'pca__n_components': range(2, X.shape[1])}]

gs = GridSearchCV(estimator=pipe_lr, 
                  param_grid=param_grid, 
                  cv=3)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

for i in range(2, X.shape[1]):
    pca.n_components = i
    pipe_lr = pipe_lr.fit(X_train, y_train)
    print i, pipe_lr.score(X_test, y_test)
然而,我看到的结果非常奇怪(我从for循环得到的数字与网格搜索得到的数字完全不同):

根据for循环,n_组件的最佳值应该在28左右,但这与我从网格搜索中得到的值还不接近


注意:我没有包括设置训练集和测试集的步骤,但我使用了sklearn的
train\u test\u split

GridSearchCV
,给出了一个交叉验证分数。在for循环中添加
交叉验证
,可能会得到更接近的结果

此外,您正在使用不同的数据。您提到您使用了
train\u test\u split
。在for循环中,你得到了X_测试、y_测试的分数。在
GridSearchCV
中,你得到了X\u列、y\u列的平均分。您的测试集中可能存在异常值

我稍微修改了您的代码,并将其应用于Boston数据集

from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import numpy as np
from sklearn.cross_validation import cross_val_score


boston = load_boston()
X = boston.data
y = boston.target

pca = PCA()
pipe_lr = Pipeline([('pca', pca),
                    ('regr', LinearRegression())])

param_grid = {'pca__n_components': np.arange(2, X.shape[1])}

gs = GridSearchCV(estimator=pipe_lr, 
                  param_grid=param_grid, 
                  cv=3)
gs = gs.fit(X, y)
print(gs.best_score_)
print(gs.best_params_)


all_scores = []
for i in range(2, X.shape[1]):
    pca.n_components = i
    scores = cross_val_score(pipe_lr,X,y,cv=3)
    all_scores.append(np.mean(scores))
    print(i,np.mean(scores))

print('Best result:',all_scores.index(max(all_scores)),max(all_scores))
给出:

0.35544286032
{'pca__n_components': 9}
2 -0.419093097857
3 -0.192078129541
4 -0.24988282122
5 -0.0909566048894
6 0.197185975618
7 0.173454370084
8 0.276509863992
9 0.355148081819
10 -17.2280089182
11 -0.291804450954
12 -0.281263153468
Best result: 7 0.355148081819
0.35544286032
{'pca__n_components': 9}
2 -0.419093097857
3 -0.192078129541
4 -0.24988282122
5 -0.0909566048894
6 0.197185975618
7 0.173454370084
8 0.276509863992
9 0.355148081819
10 -17.2280089182
11 -0.291804450954
12 -0.281263153468
Best result: 7 0.355148081819