Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/batch-file/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 决策树累加器的预测概率等价_Python_Scikit Learn_Regression_Prediction_Decision Tree - Fatal编程技术网

Python 决策树累加器的预测概率等价

Python 决策树累加器的预测概率等价,python,scikit-learn,regression,prediction,decision-tree,Python,Scikit Learn,Regression,Prediction,Decision Tree,scikit learn的DecisionTreeClassifier支持通过predict_proba()函数预测每个类别的概率。这在决策树浏览器中不存在: AttributeError:“DecisionTreeRegressor”对象没有属性“predict\u proba” 我的理解是,决策树分类器和回归器之间的基本机制非常相似,主要区别在于回归器的预测是作为潜在叶的平均值计算的。所以我希望能够提取每个值的概率 是否有其他方法来模拟此情况,例如通过处理?forDecisionTreeCl

scikit learn的
DecisionTreeClassifier
支持通过
predict_proba()
函数预测每个类别的概率。这在
决策树浏览器中不存在:

AttributeError:“DecisionTreeRegressor”对象没有属性“predict\u proba”

我的理解是,决策树分类器和回归器之间的基本机制非常相似,主要区别在于回归器的预测是作为潜在叶的平均值计算的。所以我希望能够提取每个值的概率


是否有其他方法来模拟此情况,例如通过处理?for
DecisionTreeClassifier
predict\u proba
不能直接转移。

您可以从树结构中获取数据:

import sklearn
import numpy as np
import graphviz
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import make_regression

# Generate a simple dataset
X, y = make_regression(n_features=2, n_informative=2, random_state=0)
clf = DecisionTreeRegressor(random_state=0, max_depth=2)
clf.fit(X, y)
# Visualize the tree
graphviz.Source(sklearn.tree.export_graphviz(clf)).view()

如果调用
clf.apply(X)
,您将获得实例所属的节点id:

array([6, 5, 6, 3, 2, 5, 5, 3, 6, ... 5, 5, 6, 3, 2, 2, 5, 2, 2], dtype=int64)
将其与目标变量合并在一起:

df = pd.DataFrame(np.vstack([y, clf.apply(X)]), index=['y','node_id']).T
    y           node_id
0   190.370562  6.0
1   13.339570   5.0
2   141.772669  6.0
3   -3.069627   3.0
4   -26.062465  2.0
5   54.922541   5.0
6   25.952881   5.0
       ...
现在,如果您在
node\u id
上执行groupby,后跟mean,您将获得与
clf.predict(X)

哪些是我们树中树叶的
s:

>>> clf.tree_.value[6]
array([[184.00566679]])
要获取新数据集的节点ID,需要调用

clf.decision\u路径(X[:5]).toarray()

这显示了一个这样的数组

array([[1, 0, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1, 0, 1],
       [1, 1, 0, 1, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0]], dtype=int64)
需要获取最后一个非零元素(即叶子)的位置

所以,如果你不想预测平均值,而是想预测中位数,你会这么做

>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x: x.nonzero()[0].max(
    ), axis=1).to_frame(name='node_id').join(df.groupby('node_id').median(), on='node_id')['y']
0    181.381106
1     54.053170
2    181.381106
3    -28.591188
4    -93.891889

此函数调整来自的代码,以提供每个结果的概率:

from sklearn.tree import DecisionTreeRegressor
import pandas as pd

def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
    """Trains DecisionTreeRegressor model and predicts probabilities of each y.

    Args:
        X_train: Training features.
        y_train: Training labels.
        X_test: New data to predict on.
        **kwargs: Other arguments passed to DecisionTreeRegressor.

    Returns:
        DataFrame with columns for record_id (row of X_test), y 
        (predicted value), and prob (of that y value).
        The sum of prob equals 1 for each record_id.
    """
    # Train model.
    m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
    # Get y values corresponding to each node.
    node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
    # Calculate probability as 1 / number of y values per node.
    node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
    # Aggregate per node-y, in case of multiple training records with the same y.
    node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
        .reset_index()
    # Extract predicted leaf node for each new observation.
    leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
        lambda x:x.to_numpy().nonzero()[0].max(), axis=1).to_frame(
            name='node_id')
    leaf['record_id'] = leaf.index
    # Merge with y values and drop node_id.
    return leaf.merge(node_ys_dedup, on='node_id').drop(
        'node_id', axis=1).sort_values(['record_id', 'y'])
示例(见):


顺便说一句,我明白这一点。我想量化随机森林和树木之间的差异,以产生预测间隔。谢谢,这段代码真的很有用。我对其进行了调整,以获得中每个值的概率。
nonzero()
已被弃用。修复方法是将
更改为_numpy().nonzero()
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x:x.nonzero()[0].max(), axis=1)
0    6
1    5
2    6
3    3
4    2
dtype: int64
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x: x.nonzero()[0].max(
    ), axis=1).to_frame(name='node_id').join(df.groupby('node_id').median(), on='node_id')['y']
0    181.381106
1     54.053170
2    181.381106
3    -28.591188
4    -93.891889
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
    """Trains DecisionTreeRegressor model and predicts probabilities of each y.

    Args:
        X_train: Training features.
        y_train: Training labels.
        X_test: New data to predict on.
        **kwargs: Other arguments passed to DecisionTreeRegressor.

    Returns:
        DataFrame with columns for record_id (row of X_test), y 
        (predicted value), and prob (of that y value).
        The sum of prob equals 1 for each record_id.
    """
    # Train model.
    m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
    # Get y values corresponding to each node.
    node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
    # Calculate probability as 1 / number of y values per node.
    node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
    # Aggregate per node-y, in case of multiple training records with the same y.
    node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
        .reset_index()
    # Extract predicted leaf node for each new observation.
    leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
        lambda x:x.to_numpy().nonzero()[0].max(), axis=1).to_frame(
            name='node_id')
    leaf['record_id'] = leaf.index
    # Merge with y values and drop node_id.
    return leaf.merge(node_ys_dedup, on='node_id').drop(
        'node_id', axis=1).sort_values(['record_id', 'y'])
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Works better with min_samples_leaf > 1.
res = decision_tree_regressor_predict_proba(X_train, y_train, X_test,
                                            random_state=0, min_samples_leaf=5)
res[res.record_id == 2]
#      record_id       y        prob
#   25         2    20.6    0.166667
#   26         2    22.3    0.166667
#   27         2    22.7    0.166667
#   28         2    23.8    0.333333
#   29         2    25.0    0.166667