Machine learning 基于流水线和网格搜索的多维降维技术_Machine Learning_Scikit Learn_Grid Search_Hyperparameters_Dimensionality Reduction

Machine learning 基于流水线和网格搜索的多维降维技术

machine-learning scikit-learn

Machine learning 基于流水线和网格搜索的多维降维技术,machine-learning,scikit-learn,grid-search,hyperparameters,dimensionality-reduction,Machine Learning,Scikit Learn,Grid Search,Hyperparameters,Dimensionality Reduction,我们都知道使用降维技术定义管道的常用方法，然后是用于训练和测试的模型。然后我们可以应用GridSearchCv进行超参数调优 grid = GridSearchCV( Pipeline([ ('reduce_dim', PCA()), ('classify', RandomForestClassifier(n_jobs = -1)) ]), param_grid=[ { 'reduce_dim__n_components': range(0.7,0

我们都知道使用降维技术定义管道的常用方法，然后是用于训练和测试的模型。然后我们可以应用GridSearchCv进行超参数调优

grid = GridSearchCV(
Pipeline([
    ('reduce_dim', PCA()),
    ('classify', RandomForestClassifier(n_jobs = -1))
    ]),
param_grid=[
    {
        'reduce_dim__n_components': range(0.7,0.9,0.1),
        'classify__n_estimators': range(10,50,5),
        'classify__max_features': ['auto', 0.2],
        'classify__min_samples_leaf': [40,50,60],
        'classify__criterion': ['gini', 'entropy']
    }
],
cv=5, scoring='f1')
grid.fit(X,y)

我能理解上面的代码

现在我正在浏览《今日》杂志，在那里我发现了一个有点奇怪的部分代码

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', 'passthrough'),                        # How does this work??
    ('classify', LinearSVC(dual=False, max_iter=10000))
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,   ### No PCA is used..??
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)

首先，在定义管道时，它使用字符串“passthrough”而不是对象。

    ('reduce_dim', 'passthrough'),  ```

然后，在为网格搜索定义不同的降维技术时，使用了不同的策略。

[PCA（iterated_power=7），NMF（）]

这是如何工作的？

        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,  # here

请有人给我解释一下密码

已解决-在一行中，顺序为

['PCA'、'NMF'、'KBest（chi2）]

由-seralouk提供（见下面的答案）

如果有人想了解更多详细信息，请参考

据我所知，这是相当的。

('reduce_dim', 'passthrough'), ```

在文档中，您有以下内容：

pipe = Pipeline([ # the reduce_dim stage is populated by the param_grid ('reduce_dim', 'passthrough'), ('classify', LinearSVC(dual=False, max_iter=10000)) ]) N_FEATURES_OPTIONS = [2, 4, 8] C_OPTIONS = [1, 10, 100, 1000] param_grid = [ { 'reduce_dim': [PCA(iterated_power=7), NMF()], 'reduce_dim__n_components': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, { 'reduce_dim': [SelectKBest(chi2)], 'reduce_dim__k': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, ]
最初我们有
（'reduce\u dim'，'passthrough'），
，然后是
'reduce\u dim'：[PCA（iterated\u power=7），NMF（）]
PCA的定义在第二行完成

您也可以定义：

pipe = Pipeline([ # the reduce_dim stage is populated by the param_grid ('reduce_dim', PCA(iterated_power=7)), ('classify', LinearSVC(dual=False, max_iter=10000)) ]) N_FEATURES_OPTIONS = [2, 4, 8] C_OPTIONS = [1, 10, 100, 1000] param_grid = [ { 'reduce_dim__n_components': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, { 'reduce_dim': [SelectKBest(chi2)], 'reduce_dim__k': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, ]

因此，稍后它会在passthrough的位置指定一个对象。但是
'reduce\u dim'：[PCA（iterated\u power=7），NMF（）]
这是如何工作的？网格搜索会一个接一个地尝试吗？它同时使用这两种方法！为了减少维数，它将首先对数据集应用PCA，然后应用NMF。然后用
选择kbest（chi2）
进行检查。哪一个得分更高，它会选择…对吗？没错。顺序是
['PCA'、'NMF'、'KBest（chi2）']