sklearn、python中的网格搜索技术_Python_Machine Learning_Scikit Learn_Cross Validation_Grid Search

sklearn、python中的网格搜索技术

python machine-learning scikit-learn

sklearn、python中的网格搜索技术,python,machine-learning,scikit-learn,cross-validation,grid-search,Python,Machine Learning,Scikit Learn,Cross Validation,Grid Search,我正在研究一种有监督的机器学习算法，它似乎有一种奇怪的行为。因此，让我开始：我有一个传递不同分类器、其参数、训练数据及其标签的函数： def HT(targets,train_new, algorithm, parameters): #creating my scorer scorer=make_scorer(f1_score) #creating the grid search object with the parameters of the function grid_search =

我正在研究一种有监督的机器学习算法，它似乎有一种奇怪的行为。因此，让我开始：

我有一个传递不同分类器、其参数、训练数据及其标签的函数：

def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm, 
param_grid=parameters,scoring=scorer,   cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results    
return pipeline

我向该函数传递如下参数：

clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]), 
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})

好吧，这就是事情开始变得奇怪的地方。此函数返回f1_分数，但与我使用以下公式手动计算的分数不同： F1=2*（精度*召回）/（精度+召回）

差异很大（0.68与0.89相比）

我在函数中做错了什么？ grid_search（grid_search.best_score）计算的分数应与整个训练集（grid_search.best_estimator.predict（train_new））的分数相同？

谢谢

您正在手动计算的分数考虑了所有类的全局真实正数和负数。但在scikit中，f1_分数的默认方法是计算二进制平均值（即仅适用于正类）

因此，为了获得相同的分数，请使用以下规定的f1_分数：

scorer=make_scorer(f1_score, average='micro')

或者简单地说，在gridSearchCV中，使用：

scoring = 'f1_micro'

关于如何进行分数平均的更多信息，请参见： -

您可能还想查看以下答案，该答案详细描述了scikit中的分数计算：-

编辑：将宏更改为微观。如文件中所述：

“micro”：通过计算总的true来全局计算度量阳性、假阴性和假阳性

请指定手动计算分数的方式。这是二元分类还是多标签分类？同时将问题标题更改为与分数差异相关的更合适的标题。当前标题与您的实际问题关系不大谢谢您的回答Vivek。。我的问题是二元分类问题。我知道训练数据和标签，我正在应用公式。此外，在执行网格搜索之后，为了进行预测，我是否需要使用网格搜索的最佳参数对整个训练集再次拟合模型？我假设进行交叉验证的网格搜索只返回适合训练集一部分的分类器。@Vlad否。网格搜索CV估计器将使用最佳参数重新调整整个训练数据。您可以查看文档。实际上，它的构造函数中有一个参数“refit”。默认情况下这是真的。因此，它将用最佳参数重新调整提供给它的所有数据。谢谢Vivek。帮了大忙