Python 在Scikit Learn中运行SelectKBest后获取功能名称的最简单方法_Python_Pandas_Scikit Learn_Feature Selection

Python 在Scikit Learn中运行SelectKBest后获取功能名称的最简单方法

python pandas scikit-learn

Python 在Scikit Learn中运行SelectKBest后获取功能名称的最简单方法,python,pandas,scikit-learn,feature-selection,Python,Pandas,Scikit Learn,Feature Selection,我想做监督学习到目前为止，我知道对所有功能进行监督学习不过，我还想用K最佳特性进行实验我阅读了文档，发现在Scikit中有SelectKBest方法不幸的是，我不知道在找到这些最佳功能后如何创建新的数据帧： from sklearn.feature_selection import SelectKBest, f_classif select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(f

我想做监督学习

到目前为止，我知道对所有功能进行监督学习

不过，我还想用K最佳特性进行实验

我阅读了文档，发现在Scikit中有SelectKBest方法

不幸的是，我不知道在找到这些最佳功能后如何创建新的数据帧：

from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

让我们假设我想用5个最佳特性进行实验：

from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

现在，如果我要添加下一行：

dataframe = pd.DataFrame(select_k_best_classifier)

我将收到一个没有功能名称的新数据帧（仅索引从0到4）

我应将其替换为：

dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)

我的问题是如何创建功能名称列表

我知道我应该使用：

 select_k_best_classifier.get_support()

它返回布尔值数组

数组中的真值表示右列中的索引

如何将此布尔数组与通过以下方法获得的所有要素名称的数组一起使用：

feature_names = list(features_dataframe.columns.values)

您可以执行以下操作：

mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features

for bool, feature in zip(mask, feature_names):
    if bool:
        new_features.append(feature)

然后更改功能的名称：

from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

这不需要循环

# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols]

下面的代码将帮助您找到F分数最高的K个特性。设X为数据框，其列为所有特征，y为类标签列表

import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

对我来说，这段代码工作得很好，更像“pythonic”：

mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]

还有另一种替代方法，但是，它不如上述解决方案快

# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
                            index=train.index,
                            columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]

根据chi2选择最佳10个特征

from sklearn.feature_selection import SelectKBest, chi2

KBest = SelectKBest(chi2, k=10).fit(X, y)

使用Get_support（）获取功能

创建名为X_new的新df

X_new = X[X.columns[f]] # final features`

稍微更正一下：features\u df\u new=features\u df.iloc[：，cols]@volody谢谢，我已经更新了答案。可能是以前的语法起作用了？我不确定。请注意.get\u support（）必须应用于SelectKBest（score\u func=f\u classif，k=5）（一个类'sklearn.feature\u selection.univariate\u selection.SelectKBest'），而不是SelectKBest（score\u func=f\u classif，k=5）。fit\u transform（X，Y）（一个numpy数组）