Machine learning 基于流水线和网格搜索的多维降维技术
我们都知道使用降维技术定义管道的常用方法,然后是用于训练和测试的模型。然后我们可以应用GridSearchCv进行超参数调优Machine learning 基于流水线和网格搜索的多维降维技术,machine-learning,scikit-learn,grid-search,hyperparameters,dimensionality-reduction,Machine Learning,Scikit Learn,Grid Search,Hyperparameters,Dimensionality Reduction,我们都知道使用降维技术定义管道的常用方法,然后是用于训练和测试的模型。然后我们可以应用GridSearchCv进行超参数调优 grid = GridSearchCV( Pipeline([ ('reduce_dim', PCA()), ('classify', RandomForestClassifier(n_jobs = -1)) ]), param_grid=[ { 'reduce_dim__n_components': range(0.7,0
grid = GridSearchCV(
Pipeline([
('reduce_dim', PCA()),
('classify', RandomForestClassifier(n_jobs = -1))
]),
param_grid=[
{
'reduce_dim__n_components': range(0.7,0.9,0.1),
'classify__n_estimators': range(10,50,5),
'classify__max_features': ['auto', 0.2],
'classify__min_samples_leaf': [40,50,60],
'classify__criterion': ['gini', 'entropy']
}
],
cv=5, scoring='f1')
grid.fit(X,y)
我能理解上面的代码
现在我正在浏览《今日》杂志,在那里我发现了一个有点奇怪的部分代码
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'), # How does this work??
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS, ### No PCA is used..??
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)
('reduce_dim', 'passthrough'), ```
[PCA(iterated_power=7),NMF()]
这是如何工作的?
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS, # here
['PCA'、'NMF'、'KBest(chi2)]
由-seralouk提供(见下面的答案)
如果有人想了解更多详细信息,请参考
据我所知,这是相当的。
('reduce_dim', 'passthrough'), ```
在文档中,您有以下内容:
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
最初我们有('reduce\u dim','passthrough'),
,然后是'reduce\u dim':[PCA(iterated\u power=7),NMF()]
PCA的定义在第二行完成
您也可以定义:
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', PCA(iterated_power=7)),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
因此,稍后它会在passthrough的位置指定一个对象。但是
'reduce\u dim':[PCA(iterated\u power=7),NMF()]
这是如何工作的?网格搜索会一个接一个地尝试吗?它同时使用这两种方法!为了减少维数,它将首先对数据集应用PCA,然后应用NMF。然后用选择kbest(chi2)
进行检查。哪一个得分更高,它会选择…对吗?没错。顺序是['PCA'、'NMF'、'KBest(chi2)']