Python 为什么GridSearchCV方法的准确度低于标准方法?
我使用train_test_split(Python 为什么GridSearchCV方法的准确度低于标准方法?,python,decision-tree,grid-search,hyperparameters,train-test-split,Python,Decision Tree,Grid Search,Hyperparameters,Train Test Split,我使用train_test_split(random_state=0)和决策树(decision tree)对我的数据进行建模,我运行了大约50次以获得最佳精度 import pandas as pd import numpy as np from sklearn import tree from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split La
random_state=0
)和决策树(decision tree)对我的数据进行建模,我运行了大约50次以获得最佳精度
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Laptop = pd.ExcelFile(r"D:\Laptop.xlsx", data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)
train, test = train_test_split(data, test_size = 0.15)
print("Training size: {}; Test size: {}".format(len(train), len(test)))
c = DecisionTreeClassifier()
features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]
x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]
dt = c.fit(x_train, y_train)
y_pred = c.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100
print ("Accuracy using Decision Tree:", round(score, 1), "%")
在第二步中,我决定使用GridSearchCV方法设置树参数
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline
Laptop = pd.ExcelFile(r"D:\Laptop.xlsx", data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)
train, test = train_test_split(data, test_size = 0.15, random_state = 0)
print("Training size: {}; Test size: {}".format(len(train), len(test)))
features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]
x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]
from sklearn.model_selection import GridSearchCV
param_dist = {"max_depth":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
"min_samples_leaf":randint (10,60)}
tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(x_train, y_train)
print("Tuned Decisio Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is: {}".format(tree_cv.best_score_))
y_pred = tree_cv.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100
print ("Accuracy using Decision Tree:", round(score, 1), "%")
我在第一种方法中的最佳精度比GridSearchCV方法要好
为什么会这样
你知道最准确地得到最好树的最好方法吗 为什么会发生这种情况? 没有看到你的代码,我只能推测。它可能基于网格的粒度。如果你正在做50个组合,但是有数十亿个可能的组合,那么这对于搜索空间来说是毫无意义的。是否有一种方法可以优化您正在搜索的参数 您知道获得最准确的最佳树的最佳方法吗? 这是一个很难回答的问题,因为您需要定义准确性。您可以构建一个模型,该模型将过度拟合您的测试数据。从技术上讲,获得最佳树的方法是尝试超参数的所有可能组合,但是对于任何合理数量的参数,这将永远花费时间。通常,最好的方法是使用贝叶斯方法来搜索超参数空间,但您将返回每个参数的分布。我的建议是从随机搜索开始,而不是网格搜索。如果你是Skopt的忠实粉丝,你可以使用BayesSearch。我建议阅读代码,因为我认为它的文档记录很差
import pandas as pd
import numpy as np
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 25000
# Classifier
bayes_cv_tuner = BayesSearchCV(
estimator = xgb.XGBClassifier(
n_jobs = 1,
objective = 'binary:logistic',
eval_metric = 'auc',
silent=1,
tree_method='approx'
),
search_spaces = {
'learning_rate': (0.01, 1.0, 'log-uniform'),
'min_child_weight': (0, 10),
'max_depth': (0, 50),
'max_delta_step': (0, 20),
'subsample': (0.01, 1.0, 'uniform'),
'colsample_bytree': (0.01, 1.0, 'uniform'),
'colsample_bylevel': (0.01, 1.0, 'uniform'),
'reg_lambda': (1e-9, 1000, 'log-uniform'),
'reg_alpha': (1e-9, 1.0, 'log-uniform'),
'gamma': (1e-9, 0.5, 'log-uniform'),
'min_child_weight': (0, 5),
'n_estimators': (50, 100),
'scale_pos_weight': (1e-6, 500, 'log-uniform')
},
scoring = 'roc_auc',
cv = StratifiedKFold(
n_splits=3,
shuffle=True,
random_state=42
),
n_jobs = 3,
n_iter = ITERATIONS,
verbose = 0,
refit = True,
random_state = 42
)
result = bayes_cv_tuner.fit(X.values, y.values)
斯科普:
代码:为什么会发生这种情况? 没有看到你的代码,我只能推测。它可能基于网格的粒度。如果你正在做50个组合,但是有数十亿个可能的组合,那么这对于搜索空间来说是毫无意义的。是否有一种方法可以优化您正在搜索的参数 您知道获得最准确的最佳树的最佳方法吗? 这是一个很难回答的问题,因为您需要定义准确性。您可以构建一个模型,该模型将过度拟合您的测试数据。从技术上讲,获得最佳树的方法是尝试超参数的所有可能组合,但是对于任何合理数量的参数,这将永远花费时间。通常,最好的方法是使用贝叶斯方法来搜索超参数空间,但您将返回每个参数的分布。我的建议是从随机搜索开始,而不是网格搜索。如果你是Skopt的忠实粉丝,你可以使用BayesSearch。我建议阅读代码,因为我认为它的文档记录很差
import pandas as pd
import numpy as np
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 25000
# Classifier
bayes_cv_tuner = BayesSearchCV(
estimator = xgb.XGBClassifier(
n_jobs = 1,
objective = 'binary:logistic',
eval_metric = 'auc',
silent=1,
tree_method='approx'
),
search_spaces = {
'learning_rate': (0.01, 1.0, 'log-uniform'),
'min_child_weight': (0, 10),
'max_depth': (0, 50),
'max_delta_step': (0, 20),
'subsample': (0.01, 1.0, 'uniform'),
'colsample_bytree': (0.01, 1.0, 'uniform'),
'colsample_bylevel': (0.01, 1.0, 'uniform'),
'reg_lambda': (1e-9, 1000, 'log-uniform'),
'reg_alpha': (1e-9, 1.0, 'log-uniform'),
'gamma': (1e-9, 0.5, 'log-uniform'),
'min_child_weight': (0, 5),
'n_estimators': (50, 100),
'scale_pos_weight': (1e-6, 500, 'log-uniform')
},
scoring = 'roc_auc',
cv = StratifiedKFold(
n_splits=3,
shuffle=True,
random_state=42
),
n_jobs = 3,
n_iter = ITERATIONS,
verbose = 0,
refit = True,
random_state = 42
)
result = bayes_cv_tuner.fit(X.values, y.values)
斯科普:
代码:这取决于您为GridSearchCV指定的参数限制
没有任何参数的决策树的默认参数值不在您手动指定的范围内。选择一组更好的参数,然后再次尝试GridSearchCV。这取决于您为GridSearchCV指定的参数限制
没有任何参数的决策树的默认参数值不在您手动指定的范围内。请选择一组更好的参数,然后再次尝试GridSearchCV。请添加代码,最好是部分数据,刚好足以获得a。共享您的工作以创建a。确保您的搜索空间包含第一种方法的默认超参数我添加了我的代码,@MaximiliaPetersi添加了我的代码,@SıddıkAçıl请添加代码,最好是您的部分数据,刚好足以获得a。共享您的工作以创建a。确保您的搜索空间包含第一种方法的默认超参数i添加了我的代码,@MaximilianPetersI添加了我的代码,@SıddıkAılth第一种方法中的最佳树参数包含在第二种方法的参数范围内@Amalik2205第一种方法中的最佳树参数包含在第二种方法的参数范围内@amalik2205