Machine learning 随机森林学习者的高方差_Machine Learning_Scikit Learn_Random Forest_Variance

Machine learning 随机森林学习者的高方差

machine-learning scikit-learn

Machine learning 随机森林学习者的高方差,machine-learning,scikit-learn,random-forest,variance,Machine Learning,Scikit Learn,Random Forest,Variance,我使用随机森林回归器来拟合一个10维回归问题，大约有30万个样本。虽然在处理随机森林时没有必要，但我首先将数据放在相同的尺度上（通过使用sklearn的预处理），然后对以下参数空间进行随机搜索： n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)] max_features= auto, sqrt max_depth= from 1- to 150 with step =11

我使用随机森林回归器来拟合一个10维回归问题，大约有30万个样本。虽然在处理随机森林时没有必要，但我首先将数据放在相同的尺度上（通过使用sklearn的预处理），然后对以下参数空间进行随机搜索：

    n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
    max_features= auto, sqrt
    max_depth= from 1- to 150 with step =11
    min_sampl_split=2,5,10,12
    min_samples_leaf=1,2,4,6
    Bootstrap true or false

此外，在获得最佳参数后，我进行了第二次更窄范围的搜索。虽然我使用10倍交叉验证方案和随机搜索，但我仍然遇到严重的过度拟合问题！此外，我还尝试使用DBSCAN算法检查异常值。排除数据集的某些部分后，我得到了更糟糕的结果！我应该在随机搜索中包含随机林的其他参数吗？或者我应该在拟合之前对数据集应用更多的预处理技术

为了方便起见，这是我写的实现：

from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions 
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42, 
n_jobs = 32)
rf_random.fit(x_train, y_train)

randomizedsearch函数返回的最佳参数：引导程序：Fasle。最小样本数=2。n_估计量=1647。最大功能：sqrt。最小样本分割=3。最大深度：无

目标的范围为0到10000[单位]。该模型在训练集上获得6.98[单位]RMSE准确度，在测试集上获得67.54[单位]RMSE准确度的平均值

max_depth=从1到150，步长=11

对于10特征问题，最佳深度在10以下。正因为如此，你简直是太过合身了。考虑将Max深度从1到15，步骤1

min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6

这将有助于减少差异，但是，对于最大深度，第11步将扼杀您可能做出的所有努力

谢谢您的回答。我尝试使用您指定的范围再次调整，但这并没有产生更好的模型。它仍然是过度拟合的：/您能否显示结果以及您认为自己过度拟合的原因。您的代码为10倍，将不胜感激。因为现在我不知道问题是什么。目标的范围（我想要预测的值）是从0到10000[单位]。该模型在训练集上的RMSE精度为6.98[单位]，在测试集上的平均RMSE精度为67.54[单位]。为了实现10倍，我刚刚使用了sklearn的ShuffleSplit。为了方便起见，我在上面的帖子中添加了我的实现和最佳参数。你能试试rf_random=RandomizedSearchCV（估计器=rf，参数分布=random_grid，n_iter=50，cv=10，verbose=2，random_state=42，n_jobs=32）同样的结果：/