Scikit learn 如何从Sklearn管道中提取特征重要性
我在Scikit Learn中构建了一个管道,它分为两个步骤:一个是构造特征,第二个是随机分类器 虽然我可以保存该管道,查看各个步骤和步骤中设置的各种参数,但我希望能够检查结果模型中的特征重要性 有可能吗?啊,是的 您可以列出要检查估计器的步骤: 例如:Scikit learn 如何从Sklearn管道中提取特征重要性,scikit-learn,random-forest,Scikit Learn,Random Forest,我在Scikit Learn中构建了一个管道,它分为两个步骤:一个是构造特征,第二个是随机分类器 虽然我可以保存该管道,查看各个步骤和步骤中设置的各种参数,但我希望能够检查结果模型中的特征重要性 有可能吗?啊,是的 您可以列出要检查估计器的步骤: 例如: pipeline.steps[1] 返回: ('predictor', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
pipeline.steps[1]
返回:
('predictor',
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2,
oob_score=False, random_state=None, verbose=0,
warm_start=False))
然后,您可以直接访问模型步骤:
pipeline.steps[1][1].feature\u importances\u我写了一篇文章,介绍了如何进行这项工作 通常,对于管道,您可以访问
命名的\u步骤
参数。这将为您提供管道中的每个变压器。例如,对于这条管道:
model = Pipeline(
[
("vectorizer", CountVectorizer()),
("transformer", TfidfTransformer()),
("classifier", classifier),
])
我们可以通过执行model.named_steps[“transformer”]访问各个功能步骤。获取_feature_names()
这将从tfidftranformer
返回功能名称列表。这一切都很好,但并没有涵盖很多用例,因为我们通常希望结合一些特性。以这个模型为例:
model = Pipeline([
("union", FeatureUnion(transformer_list=[
("h1", TfidfVectorizer(vocabulary={"worst": 0})),
("h2", TfidfVectorizer(vocabulary={"best": 0})),
("h3", TfidfVectorizer(vocabulary={"awful": 0})),
("tfidf_cls", Pipeline([
("vectorizer", CountVectorizer()),
("transformer", TfidfTransformer())
]
))
])
),
("classifier", classifier)])
在这里,我们使用特征联合和子管线来组合一些特征。要访问这些特性,我们需要按顺序显式调用每个命名步骤。例如,从内部管道获取TF-IDF功能,我们必须:
model.named_steps["union"].tranformer_list[3][1].named_steps["transformer"].get_feature_names()
这有点头疼,但它是可行的。通常我所做的是使用以下代码片段的变体来获得它。下面的代码只是将管道/要素联合集视为一棵树,并在运行时结合要素名称执行DFS
from sklearn.pipeline import FeatureUnion, Pipeline
def get_feature_names(model, names: List[str], name: str) -> List[str]:
"""Thie method extracts the feature names in order from a Sklearn Pipeline
This method only works with composed Pipelines and FeatureUnions. It will
pull out all names using DFS from a model.
Args:
model: The model we are interested in
names: The list of names of final featurizaiton steps
name: The current name of the step we want to evaluate.
Returns:
feature_names: The list of feature names extracted from the pipeline.
"""
# Check if the name is one of our feature steps. This is the base case.
if name in names:
# If it has the named_steps atribute it's a pipeline and we need to access the features
if hasattr(model, "named_steps"):
return extract_feature_names(model.named_steps[name], name)
# Otherwise get the feature directly
else:
return extract_feature_names(model, name)
elif type(model) is Pipeline:
feature_names = []
for name in model.named_steps.keys():
feature_names += get_feature_names(model.named_steps[name], names, name)
return feature_names
elif type(model) is FeatureUnion:
feature_names= []
for name, new_model in model.transformer_list:
feature_names += get_feature_names(new_model, names, name)
return feature_names
# If it is none of the above do not add it.
else:
return []
你也需要这个方法。它对单个转换进行操作,例如TFIDFvectorier,以获取名称。在SciKit Learn中,没有通用的get\u功能\u name
,因此您必须针对不同的情况对其进行篡改。这是我试图为大多数用例做一些合理的事情
def extract_feature_names(model, name) -> List[str]:
"""Extracts the feature names from arbitrary sklearn models
Args:
model: The Sklearn model, transformer, clustering algorithm, etc. which we want to get named features for.
name: The name of the current step in the pipeline we are at.
Returns:
The list of feature names. If the model does not have named features it constructs feature names
by appending an index to the provided name.
"""
if hasattr(model, "get_feature_names"):
return model.get_feature_names()
elif hasattr(model, "n_clusters"):
return [f"{name}_{x}" for x in range(model.n_clusters)]
elif hasattr(model, "n_components"):
return [f"{name}_{x}" for x in range(model.n_components)]
elif hasattr(model, "components_"):
n_components = model.components_.shape[0]
return [f"{name}_{x}" for x in range(n_components)]
elif hasattr(model, "classes_"):
return classes_
else:
return [name]
要获取功能的名称,请查看pipe.steps[0][1]。获取功能名称()这是一个不完整的答案。预处理和特征工程通常是管道的一部分。因此,您需要考虑到这一点。如果有多个步骤,那么一种方法是执行。对于OP的情况,这可能是
管道。命名为_步骤['predictor']。功能重要性u
。如何更改功能重要性类型?