Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python &引用;平行;使用gridsearch获取最佳模型的管道_Python_Machine Learning_Scikit Learn_Grid Search - Fatal编程技术网

Python &引用;平行;使用gridsearch获取最佳模型的管道

Python &引用;平行;使用gridsearch获取最佳模型的管道,python,machine-learning,scikit-learn,grid-search,Python,Machine Learning,Scikit Learn,Grid Search,在sklearn中,可以定义一个串行管道,以便为管道的所有连续部分获得超参数的最佳组合。串行管道可以按如下方式实现: from sklearn.svm import SVC from sklearn import decomposition, datasets from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV digits = datasets.load_digits()

在sklearn中,可以定义一个串行管道,以便为管道的所有连续部分获得超参数的最佳组合。串行管道可以按如下方式实现:

from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target

#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}
但是如果我想为管道的每一步尝试不同的算法呢?我怎样才能搜索,例如gridsearch

主成分分析或奇异值分解 支持向量机还是随机森林


这将需要某种第二级或“元网格搜索”,因为模型的类型将是超参数之一。这在sklearn中是可能的吗

管道在其
步骤中支持
None
(估计器列表),通过该步骤可以关闭管道的某些部分

您可以将
None
参数传递给管道的
named_steps
,通过在传递给GridSearchCV的参数中设置该估计器来不使用该估计器

假设您想使用和

在管道中添加
svd

pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])

# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]
现在只需将管道对象传递给gridsearchCV

grd = GridSearchCV(pipe, param_grid = params_grid)
调用
grd.fit()
将在
params_grid
列表的两个元素上搜索参数,每次使用一个元素中的所有值

如果参数具有相同的名称,则进行简化 如果“或”中的两个估计器具有与本例相同的参数名称,其中
PCA
TruncatedSVD
具有
n_分量
(或者您只想搜索此参数,可以简化为:

#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])

#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}
该方案的推广

我们可以创建一个函数,该函数可以使用适当的值自动填充要提供给
GridSearchCV
param_网格
:-

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).iteritems():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

return final_params
并在任意数量的变压器和估计器上使用此函数

# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`, 
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
                  'classifier':['svm', 'rf']}

# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(), 
                          'C':[0.1,0.2]
                         }, 

                   'rf':{'object':RandomForestClassifier(),
                         'n_estimators':[10,20]
                        },

                   'pca':{'object':PCA(),
                          'n_components':[10,20]
                         },

                   'select':{'object':SelectKBest(),
                             'k':[5,10]
                            }
                  }  


# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
现在使用上述
pipeline\u步骤中使用的名称初始化管道对象

# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])  
现在,最后设置gridSearchCV对象和拟合数据

grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)

您可以将这两种类型的估计器添加到管道中,并在gridSearchCV中将它们设置为
None
。听起来像是一个实用的解决方案。您能将其集成到上面的示例代码中并将其作为答案发布吗?我看您不需要“或”在svc和randomforest之间。我可能会编辑这个答案,使其更一般化一点,以便处理更多的估计量。谢谢,我得到了这个想法。
grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)