python:如何从feature\u重要性中获取真实的特性名称

python:如何从feature\u重要性中获取真实的特性名称,python,scikit-learn,classification,feature-selection,Python,Scikit Learn,Classification,Feature Selection,我正在使用Python的sklearn随机林(emble.RandomForestClassifier)进行分类,并正在使用feature\u importances\u查找分类器的重要特征。现在我的代码是: for trip in database: venue_feature_start.append(Counter(trip['POI'])) # Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus

我正在使用Python的
sklearn
随机林(
emble.RandomForestClassifier
)进行分类,并正在使用
feature\u importances\u
查找分类器的重要特征。现在我的代码是:

for trip in database:
    venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature

feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)

orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr()

le = LabelEncoder() 
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
    unlabelled_int = int(le.transform(["Unlabelled"]))
else:
    unlabelled_int = -1

valid_rows_idx = np.where(labels!=unlabelled_int)[0]  
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)                      
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
    train_data = data[train_ind,:]
    test_data = data[test_ind,:]
    labels_train = labels[train_ind]
    labels_test = labels[test_ind]

    print ("Training classifier...")
    clf.fit(train_data,labels_train)
    importances = clf.feature_importances_
现在的问题是,当我使用feature\u importances时,我得到一个维度为580的数组(与feature维度相同),我想知道前20个重要功能(前20个重要场所)

我想至少我应该知道的是重要度中最大的20个数字的指数,但我不知道:

  • 如何从重要性中获取前20名指标

  • 因为我使用了Dictvectorizer和TfidfTransformer,所以我不知道如何将索引与真实的场馆名称(“学校”、“家”等)匹配


  • 有什么办法帮我吗?多谢各位

    方法
    feature\u importances\u
    按特征输入算法的顺序返回相对重要性数字。因此,为了获得前20个功能,您需要从最重要的功能到最不重要的功能进行排序,例如:

    importances = forest.feature_importances_
    indices = numpy.argsort(importances)[-20:]
    

    [-20:]
    因为您需要获取数组的最后20个元素,因为
    argsort
    按升序排序)

    要获得每个要素名称的重要性,只需将列名和要素重要性一起迭代(它们相互映射):


    非常感谢,但是您知道如何将索引与真实要素名称匹配吗?
    for feat, importance in zip(df.columns, clf.feature_importances_):
        print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)