Python 使用SHAP时如何解释GBT分类器的基值？_Python_Machine Learning_Scikit Learn_Shap

Python 使用SHAP时如何解释GBT分类器的基值？

python machine-learning scikit-learn

Python 使用SHAP时如何解释GBT分类器的基值？,python,machine-learning,scikit-learn,shap,Python,Machine Learning,Scikit Learn,Shap,我最近发现。我决定使用from sklearn构建一个简单的xgboost分类器，并绘制一个force\u图为了理解图，图书馆说：上面的说明显示了每个有助于推送的功能基准值的模型输出（整个过程中的平均模型输出）我们传递给模型输出的训练数据集。推动预测值越高，显示为红色，预测值越低蓝色（这些力图在我们的自然BME中介绍纸张）因此，在我看来，基本值应该与clf.predict（X\u train）.mean（）相同，等于0.637。但是，当查看绘图时，情况并非如此，数字实际上不在[0

我最近发现。我决定使用from sklearn构建一个简单的xgboost分类器，并绘制一个

force\u图

为了理解图，图书馆说：

上面的说明显示了每个有助于推送的功能基准值的模型输出（整个过程中的平均模型输出）我们传递给模型输出的训练数据集。推动预测值越高，显示为红色，预测值越低蓝色（这些力图在我们的自然BME中介绍纸张）

因此，在我看来，基本值应该与

clf.predict（X\u train）.mean（）

相同，等于0.637。但是，当查看绘图时，情况并非如此，数字实际上不在[0,1]之内。我试着在不同的基（10，e，2）上做日志，假设它是某种单调变换。。。但仍然不是运气。我怎样才能得到这个基本值

!pip install shap

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)

print(clf.predict(X_train).mean())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

要获取原始空间中的

base\u值

（当

link=“identity”

时），需要将类标签-->展开为概率-->展开为原始分数。注意，默认损耗为“偏差”，因此原始值为反向S形：

# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]

原始空间中第0个数据点的相关绘图：

shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")

如果您希望切换到sigmoid概率空间（

link=“logit”

）：

概率空间中第0个数据点的相关绘图：

注意，概率

base_值

从shap的角度来看，如果没有可用的数据，他们称之为基线概率，而不是理性的人通过没有自变量来定义的概率（

0.6373626373

）

完整可复制示例：

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")

输出：

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

太好了！现在更有意义了。你能在回复中添加一个偏离损失的链接吗？我想了解xgboost使用的实际公式，以及为什么sigmoid是相反的。@G.Macia您继续参考xgboost，而您的问题是关于scikit learn中的GBT分类器（我编辑了您的标题）；有关绘图中的默认设置

loss='deviance'

是否应为

shap_值[0，：]

或

shap_值[1，：]

，请参见。我的理解是

或

是感兴趣的数据点的行索引，您是对的，因为这里您只保留了

[：，1]

元素在

（即类别1的概率）。关于

预期值

，它应该是基础数据集中模型的平均预测值（回归简单，但这里可能没有这么多），而不是在没有可用数据时。不过，我同意这不是大多数人会考虑的基准（优秀答案BTW，抱歉我不能两次投票）。

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522