Scikit learn sklearn random forest:。oob_分数太低？_Scikit Learn_Classification_Random Forest_Cross Validation

Scikit learn sklearn random forest:。oob_分数太低？

scikit-learn

Scikit learn sklearn random forest:。oob_分数太低？,scikit-learn,classification,random-forest,cross-validation,Scikit Learn,Classification,Random Forest,Cross Validation,我在搜索随机森林的应用程序，在Kaggle上发现了以下知识竞赛：遵照 , 我使用sklearn创建了一个包含500棵树的随机森林 .oob\u得分约为2%，但坚持组得分约为75% 只有七个类需要分类，所以2%的分类率很低。当我交叉验证时，我的分数也一直接近75% 有人能解释一下.oob\u分数与坚持/交叉验证分数之间的差异吗？我希望他们是相似的这里有一个类似的问题：编辑：我想这也可能是一个bug 代码由我发布的第二个链接中的原始海报给出。唯一的变化是，在构建随机林时，必须设置oob_s

我在搜索随机森林的应用程序，在Kaggle上发现了以下知识竞赛：

遵照

我使用sklearn创建了一个包含500棵树的随机森林

.oob\u得分约为2%，但坚持组得分约为75% 只有七个类需要分类，所以2%的分类率很低。当我交叉验证时，我的分数也一直接近75% 有人能解释一下.oob\u分数与坚持/交叉验证分数之间的差异吗？我希望他们是相似的这里有一个类似的问题：编辑：我想这也可能是一个bug 代码由我发布的第二个链接中的原始海报给出。唯一的变化是，在构建随机林时，必须设置oob_score=True 我没有保存我所做的交叉验证测试，但如果人们需要查看，我可以重做它。Q：有人能解释这种差异吗。。。 A:sklearn.essemble.RandomForestClassifier对象及其观察到的.oob\u分数属性值不是与bug相关的问题首先，随机森林{Classifier | Regressor}属于所谓的集成方法中相当特定的一个角落，因此请注意，典型的方法，包括交叉验证，与其他AI/ML学习者的工作方式不同 RandomForest，通过该方法，具有已知y={labels （对于分类器）| targets （对于回归器）} ）的样本（数据集X）在整个林生成过程中被分割，其中树通过将数据集随机分割为一部分而得到引导，树可以看到一个部分，而树不会看到（从而形成一个内部oob子集）除了对过度拟合敏感度的其他影响等外，随机森林集合不需要交叉验证，因为它不会在设计上过度拟合。许多论文和（伯克利）的实证证据都为这种说法提供了支持，因为它们带来了证据，证明CV-ed预测因子将具有相同的.oob\u分数 import sklearn.ensemble aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators = 10, # The number of trees in the forest. criterion = 'mse', # { Regressor: 'mse' | Classifier: 'gini' } max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = 'auto', max_leaf_nodes = None, bootstrap = True, oob_score = False, # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET n_jobs = 1, # { 1 | n-cores | -1 == all-cores } random_state = None, verbose = 0, warm_start = False ) aRF_PREDICTOR.estimators_ # aList of <DecisionTreeRegressor> The collection of fitted sub-estimators. aRF_PREDICTOR.feature_importances_ # array of shape = [n_features] The feature importances (the higher, the more important the feature). aRF_PREDICTOR.oob_score_ # float Score of the training dataset obtained using an out-of-bag estimate. aRF_PREDICTOR.oob_prediction_ # array of shape = [n_samples] Prediction computed with out-of-bag estimate on the training set. aRF_PREDICTOR.apply( X ) # Apply trees in the forest to X, return leaf indices. aRF_PREDICTOR.fit( X, y[, sample_weight] ) # Build a forest of trees from the training set (X, y). aRF_PREDICTOR.fit_transform( X[, y] ) # Fit to data, then transform it. aRF_PREDICTOR.get_params( [deep] ) # Get parameters for this estimator. aRF_PREDICTOR.predict( X ) # Predict regression target for X. aRF_PREDICTOR.score( X, y[, sample_weight] ) # Returns the coefficient of determination R^2 of the prediction. aRF_PREDICTOR.set_params( **params ) # Set the parameters of this estimator. aRF_PREDICTOR.transform( X[, threshold] ) # Reduce X to its most important features. 这个问题似乎离题了，因为它是关于统计的，而不是编程。嗯，听起来有点像一个bug:-/。你能把你的密码贴在什么地方吗？谢谢。我的观点是oob_分数相对于CV或坚持分数非常低。无论如何，我不能再重复这个问题了。所有的分数现在都差不多了。顺便问一下，你使用的是什么scikit学习版？0.15之前的0.15 具有相当不同的内部工作方式。鉴于最近的0.16.1-stable （希望也是0.17.0-wet-ink ），你的.oob_score\uuu对有几百棵树的森林的调用应该是合理的。我不记得我写这篇文章时使用的是哪个版本。我可能接近于当时的最新版本。我使用0.16.1.Ok重新运行了测试。很高兴知道。数据集中有多少个示例？在RandomForest学习者中有一个非常明显的例外，他们的.oob_分数限制为整个数据集的.oob_分数。这使得将整个数据集合并到训练阶段成为可能，内部机制通过用于生成树的数据集的随机拆分来处理OOB样本。更多关于这方面的信息，请参阅学术论文和Scikit学习文档。仍然在试验树木的深度和最大数量的参数，正如Kaggle关于保持树木的“多样性”的文章所指出的那样“如果.oob_分数在0.51-0.53之间，那么你的集合比不正确的随机猜测要好1%-3%，他说这是一个七类分类问题。随机猜测约为0.14。oob_分数为0.02远比随机分数差。 aRF_PREDICTOR.oob_score_ Out[79]: 0.638801 # n_estimators = 10 aRF_PREDICTOR.oob_score_ Out[89]: 0.789612 # n_estimators = 100