Python 你的电脑坏了吗?
我正在尝试使用网格搜索来找出n_组件在PCA中使用的最佳值:Python 你的电脑坏了吗?,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我正在尝试使用网格搜索来找出n_组件在PCA中使用的最佳值: from sklearn.decomposition import PCA from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pca = PCA()
pipe_lr = Pipeline([('pca', pca),
('regr', LinearRegression())])
param_grid = [{'pca__n_components': range(2, X.shape[1])}]
gs = GridSearchCV(estimator=pipe_lr,
param_grid=param_grid,
cv=3)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
for i in range(2, X.shape[1]):
pca.n_components = i
pipe_lr = pipe_lr.fit(X_train, y_train)
print i, pipe_lr.score(X_test, y_test)
然而,我看到的结果非常奇怪(我从for循环得到的数字与网格搜索得到的数字完全不同):
根据for循环,n_组件的最佳值应该在28左右,但这与我从网格搜索中得到的值还不接近
注意:我没有包括设置训练集和测试集的步骤,但我使用了sklearn的
train\u test\u split
。GridSearchCV
,给出了一个交叉验证分数。在for循环中添加交叉验证
,可能会得到更接近的结果
此外,您正在使用不同的数据。您提到您使用了train\u test\u split
。在for循环中,你得到了X_测试、y_测试的分数。在GridSearchCV
中,你得到了X\u列、y\u列的平均分。您的测试集中可能存在异常值
我稍微修改了您的代码,并将其应用于Boston数据集
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import numpy as np
from sklearn.cross_validation import cross_val_score
boston = load_boston()
X = boston.data
y = boston.target
pca = PCA()
pipe_lr = Pipeline([('pca', pca),
('regr', LinearRegression())])
param_grid = {'pca__n_components': np.arange(2, X.shape[1])}
gs = GridSearchCV(estimator=pipe_lr,
param_grid=param_grid,
cv=3)
gs = gs.fit(X, y)
print(gs.best_score_)
print(gs.best_params_)
all_scores = []
for i in range(2, X.shape[1]):
pca.n_components = i
scores = cross_val_score(pipe_lr,X,y,cv=3)
all_scores.append(np.mean(scores))
print(i,np.mean(scores))
print('Best result:',all_scores.index(max(all_scores)),max(all_scores))
给出:
0.35544286032
{'pca__n_components': 9}
2 -0.419093097857
3 -0.192078129541
4 -0.24988282122
5 -0.0909566048894
6 0.197185975618
7 0.173454370084
8 0.276509863992
9 0.355148081819
10 -17.2280089182
11 -0.291804450954
12 -0.281263153468
Best result: 7 0.355148081819
0.35544286032
{'pca__n_components': 9}
2 -0.419093097857
3 -0.192078129541
4 -0.24988282122
5 -0.0909566048894
6 0.197185975618
7 0.173454370084
8 0.276509863992
9 0.355148081819
10 -17.2280089182
11 -0.291804450954
12 -0.281263153468
Best result: 7 0.355148081819