Python 如何使用GridSearchCV测试回归交叉验证中的过度拟合？_Python_Machine Learning_Scikit Learn_Cross Validation_Overfitting Underfitting

Python 如何使用GridSearchCV测试回归交叉验证中的过度拟合？

python machine-learning scikit-learn

Python 如何使用GridSearchCV测试回归交叉验证中的过度拟合？,python,machine-learning,scikit-learn,cross-validation,overfitting-underfitting,Python,Machine Learning,Scikit Learn,Cross Validation,Overfitting Underfitting,我正在运行一组连续变量和一个连续目标的回归模型。这是我的代码： def run_RandomForest(xTrain,yTrain,xTest,yTest): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define the pipeline to evaluate model = RandomForestRegressor() fs = SelectKBest(score_func=mutua

我正在运行一组连续变量和一个连续目标的回归模型。这是我的代码：

def run_RandomForest(xTrain,yTrain,xTest,yTest):
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

  # define the pipeline to evaluate
  model = RandomForestRegressor()
  fs = SelectKBest(score_func=mutual_info_regression)
  pipeline = Pipeline(steps=[('sel',fs), ('rf', model)])

  # define the grid
  grid = dict()
  grid['sel__k'] = [i for i in range(1, xTrain.shape[1]+1)]
  search = GridSearchCV(
        pipeline,
        param_grid={
            'rf__bootstrap': [True, False],
            'rf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
            'rf__max_features': ['auto', 'sqrt'],
            'rf__min_samples_leaf': [1, 2, 4],
            'rf__min_samples_split': [2, 5, 10],
            'rf__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
        },
        scoring='neg_mean_squared_error',
        return_train_score=True,
        verbose=1,
        cv=5,
        n_jobs=-1)

  # perform the fitting
  results = search.fit(xTrain, yTrain)

  # predict prices of X_test
  y_pred = results.predict(xTest)

run_RandomForest(x_train,y_train,x_test_y_test)

我想知道这个型号是否太合适了。我读到，合并交叉验证是检查这一点的有效方法

您可以看到，我已经将cv合并到上面的代码中。然而，我完全陷入了下一步。有人能给我演示一下获取简历信息的代码，并生成一个图或一组统计数据，以便我进行过度拟合分析吗？我知道有一些这样的问题（例如和），但我不明白这两个问题具体如何转化为我的情况，因为在这两个例子中，他们只是初始化模型并进行拟合，尽管我的包含GridSearchCV？

但您当然可以调整超参数，这些超参数控制随机选择的功能数量，以从引导数据中生长每棵树。通常，通过k-折叠交叉验证来实现这一点；选择使测试样本预测误差最小化的调整参数。此外，种植更大的森林将提高预测的准确性，尽管一旦你种植了数百棵树，回报通常会递减

试试这个示例代码

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint # Look at parameters used by our current forest

print(rf.get_params())

结果:

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

{'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}
[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
0.9666666666666668

还有

import numpy as np
from sklearn.model_selection import RandomizedSearchCV # Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

结果:

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

{'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}
[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
0.9666666666666668

有关更多信息，请参阅此链接

下面是一些进行交叉验证的示例代码

# import random search, random forest, iris data, and distributions
from sklearn.model_selection import cross_validate
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

# get iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target


model = RandomForestClassifier(random_state=1)
cv = cross_validate(model, X, y, cv=5)
print(cv)
print(cv['test_score'])
print(cv['test_score'].mean())

结果:

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

{'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}
[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
0.9666666666666668

交叉验证的内部工作：

Shuffle the dataset in order to remove any kind of order
Split the data into K number of folds. K= 5 or 10 will work for most of the cases
Now keep one fold for testing and remaining all the folds for training
Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split
Now repeat this process for all the folds, every time choosing separate fold as test data
So for every iteration our model gets trained and tested on different sets of data
At the end sum up the scores from each split and get the mean score

非常感谢你的帮助，我真的很感谢你的帮助！！！