Python 如何在sklearn管道中通过特征消除选择特征名称?
我在sklearn管道中使用递归功能消除,管道如下所示:Python 如何在sklearn管道中通过特征消除选择特征名称?,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我在sklearn管道中使用递归功能消除,管道如下所示: from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC X = ['I am a sentence', 'an example'
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]
如何获取RFE选择的要素的要素名称?RFE应该选择最好的500个功能,但我真的需要看看选择了哪些功能
编辑:
我有一个复杂的管道,它包括多个管道和特征联合、百分位特征选择以及最后的递归特征消除:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
('percentile_feature_selection', fs_vect)
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
])),
])),
('percentile_feature_selection', fs),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('rfe_feature_selection', f5),
('clf', classifier),
])
我将尝试解释这些步骤。第一个管道由向量器组成,称为“向量化的_管道”,所有这些管道都有一个函数“获取_特征_名称”。第二个管道由我自己的特性组成,我也用fit、transform和get_feature_names函数实现了它们。当我使用@Kevin的建议时,我得到一个错误,即“union”(管道中我的顶级元素的名称)没有get\u feature\u names函数:
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]
另外,当我尝试从单个FeatureUnion获取要素名称时,如下所示:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]
我得到一个关键错误:
feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'
您可以使用名为_steps的属性
访问的每个步骤,以下是iris数据集上的一个示例,该示例仅选择2
功能,但解决方案将扩展
from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris.data
y = iris.target
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)
pipeline = Pipeline([
('rfe_feature_selection', f5),
('clf', LinearSVC1)
])
pipeline.fit(X, y)
使用named_steps
可以访问管道中变换对象的属性和方法。属性支持
(或方法获取支持()
)将返回所选功能的布尔掩码:
support = pipeline.named_steps['rfe_feature_selection'].support_
现在support
是一个数组,您可以使用它有效地提取所选功能(列)的名称。确保功能名称在一个列表中,而不是python列表中
import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array
feature_names[support]
array(['sepal width (cm)', 'petal width (cm)'],
dtype='|S17')
编辑
根据我上面的评论,下面是您删除CustomFeautures()函数的示例:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]
我的答案并没有真正解决如何在您的特定示例中提取您的功能,但是,很抱歉,您正在管道中创建这些功能。我不知道CustomFeatures()是什么,但您可以访问管道中的其他步骤,类似地使用命名的\u步骤来提取功能名称列表。您好<代码>管道。命名的\u步骤
只是一个字典,它有3个键:“联合”、“rfe\u功能\u选择”和“clf”。您能否发布使用pipeline.named_steps['union']获得的确切错误。获取\u feature_names()
?您提到“我得到了一个错误,即‘union’(管道中我的顶级元素的名称)没有get_feature_names函数”,但我不相信这是正确的;)。我认为问题在于,get\u feature\u names
只是FeatureUnion
上的一个方法(而不是Pipeline
),而FeatureUnion
需要它的所有转换器来拥有这样一个方法。@ivan\u bilan你能提供一个上面的CustomFeatures()函数的例子吗?我正在从事一个情绪分析项目,我试图使用sklearn pipeline添加一个数据帧功能,您的代码可以帮助您了解如何进行此操作。@StamTiniakos当然,您可以在上找到完整的代码示例,我已经为我的问题添加了更多信息,您的建议在我的管道上似乎不起作用。此解决方案在嵌套管道的情况下似乎不起作用,因为get\U feature\U名称似乎没有定义。