Python 如何避免SKCV中不必要的重新计算？_Python_Scikit Learn

Python 如何避免SKCV中不必要的重新计算？

python scikit-learn

Python 如何避免SKCV中不必要的重新计算？,python,scikit-learn,Python,Scikit Learn,我想应用网格搜索来确定应选择的功能的数量： from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectKBest, chi2 from sklearn.pipeline import Pipeline from sklearn.model_selection impor

我想应用网格搜索来确定应选择的功能的数量：

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

data = load_breast_cancer()

parameters = {'select__k': range(1,11)}

p = Pipeline([('select', SelectKBest(chi2)), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(data.data, data.target)

因此，对于每个折叠，它将计算一个排名。但是，sklearn没有只计算一次排名，而是计算它的次数。在这种情况下，100次而不是10次。有没有办法给sklearn一个提示，以避免重新计算

更新：

我找到了一个解决方案，但它相当粗糙。因此，如果您有更好的想法，请告诉我：

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

map_fold2ranking = {}

class WrapperSelection(SelectKBest):
    def __init__(self, selection, k=10):
        self.k = k
        self.selection = selection

    def fit(self, X, y=None):
        hash_for_fold_ids = np.sum(X.index.values)

        if hash_for_fold_ids in map_fold2ranking:
            self.scores_ = map_fold2ranking[hash_for_fold_ids]
            return self
        self.selection.fit(X,y)
        map_fold2ranking[hash_for_fold_ids] = self.selection.scores_
        self.scores_ = self.selection.scores_

        return self

data = load_breast_cancer()

parameters = {'select__k': range(1, 11)}

p = Pipeline([('select', WrapperSelection(SelectKBest(chi2))), ('model', LogisticRegression())])
clf = GridSearchCV(p, parameters, cv=10, refit=False)
clf.fit(pd.DataFrame(data.data), data.target)

先谢谢你

致以最良好的祝愿，费利克斯

但是，sklearn并不是只计算一次排名，而是计算出它的次数乘以参数的次数

这正是我们的目的。对数据的所有分区以及每个分区上的所有参数对模型进行评估

从文档中：

估计量在指定参数值上的穷举搜索

但是，sklearn并不是只计算一次排名，而是计算出它的次数乘以参数的次数

这正是我们的目的。对数据的所有分区以及每个分区上的所有参数对模型进行评估

从文档中：

估计量在指定参数值上的穷举搜索

那么，你不同意在这种情况下我们可以节省计算吗？计算不是重复的。它们是在数据的不同分区上完成的。如果你想节省一些计算时间，你可以寻找随机搜索学习。它与网格搜索一致，一般迭代较少的参数，计算是重复的。只需计算一次每个折叠的排名就足够了。这意味着我们需要计算十次排名，进行十次交叉验证。之后，我们可以对这些排名应用不同的ks。所以，你不同意我们可以在这种情况下节省计算？计算不是重复的。它们是在数据的不同分区上完成的。如果你想节省一些计算时间，你可以寻找随机搜索学习。它与网格搜索一致，一般迭代较少的参数，计算是重复的。只需计算一次每个折叠的排名就足够了。这意味着我们需要计算十次排名，进行十次交叉验证。之后，我们可以对这些排名应用不同的ks。