Scikit learn 如何从我的模型中获得具有k个最重要特征的图?

Scikit learn 如何从我的模型中获得具有k个最重要特征的图?,scikit-learn,random-forest,Scikit Learn,Random Forest,您好,我正在使用带有多个矩阵的随机森林,我想获得我的模型的k个最佳功能 我的意思是,在我的模型中,仅3、4或k功能更具相关性,我尝试如下: 然而,这种方法的问题是,我得到了我所有功能的图,因为我计算了很多 这并不像我希望的那样可解释,因此我希望能够支持修改上述代码,以获得一个固定数字的图形 对于特性,我想将其作为一个参数进行修复 import numpy as np import matplotlib.pyplot as plt train_matrix = np.concatenate([s

您好,我正在使用带有多个矩阵的随机森林,我想获得我的模型的k个最佳功能

我的意思是,在我的模型中,仅3、4或k功能更具相关性,我尝试如下:

然而,这种方法的问题是,我得到了我所有功能的图,因为我计算了很多 这并不像我希望的那样可解释,因此我希望能够支持修改上述代码,以获得一个固定数字的图形 对于特性,我想将其作为一个参数进行修复

import numpy as np
import matplotlib.pyplot as plt

train_matrix = np.concatenate([state_matrix,company_matrix,seg,complete_work,sub_rep,b_tec,time1,time2,time3,time4,time5,len1], axis=1)

#Performing a shuffle of my data
index_list = list(range(train_matrix.shape[0]))
random.shuffle(index_list)
train_matrix= train_matrix[index_list]
labels_list= labels_list[index_list]

print('times shape: ', time_matrix.shape)
print('cities shape: ', cities.shape)
print('labels1 shape: ', labels1.shape)
print('state shape: ', state_matrix.shape)
print('work type shape: ', work_type.shape)
print('train matrix shape', train_matrix.shape)
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    train_matrix, labels_list.tolist(), test_size=0.1, random_state=47)

clf2 = RandomForestClassifier(n_estimators=100,n_jobs=4)

print("vectorization completed")
print("begining training")
import timeit
start_time = timeit.default_timer()

clf2 = clf2.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time

print('Matrix time shape: '+str(train_matrix.shape)+' Time Seconds: ',elapsed)

#with open('random_forest.pickle','wb') as idxf:
#    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
print("finishing training")

y_pred = clf2.predict(X_test)
以下是我想修改的部分,以便仅获得我的模型的k个最佳值:

importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.savefig('fig1.png', dpi = 600)

plt.show()
这是代码的另一部分:

print("PREDICTION REPORT")
# importing Confusion Matrix and recall
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print(confusion_matrix(y_test, y_pred))

# to print unique values
print(set(y_test))
print(set(y_pred))

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


Output:

Feature ranking:
1. feature 660 (0.403711)
2. feature 655 (0.139531)
3. feature 659 (0.058074)
4. feature 658 (0.057855)
5. feature 321 (0.015031)
6. feature 322 (0.012731)
7. feature 324 (0.011937)
8. feature 336 (0.011728)
9. feature 650 (0.011174)
10. feature 656 (0.010441)
11. feature 657 (0.009340)
12. feature 337 (0.007385)
13. feature 509 (0.005184)
14. feature 330 (0.005056)
15. feature 325 (0.004927)
16. feature 344 (0.004891)
17. feature 326 (0.004495)
18. feature 334 (0.004349)
19. feature 333 (0.004291)
20. feature 352 (0.004284)
21. feature 338 (0.004164)
22. feature 285 (0.003909)
23. feature 345 (0.003631)
24. feature 652 (0.003341)
25. feature 329 (0.003168)
26. feature 651 (0.002890)
27. feature 388 (0.002680)
28. feature 146 (0.002650)
29. feature 332 (0.002482)
30. feature 217 (0.002475)
31. feature 513 (0.002363)
32. feature 216 (0.002309)
33. feature 116 (0.002223)
34. feature 323 (0.002107)
35. feature 213 (0.002104)
36. feature 328 (0.002101)
37. feature 102 (0.002088)
38. feature 315 (0.002083)
39. feature 307 (0.002079)
40. feature 427 (0.002043)
41. feature 351 (0.001925)
42. feature 259 (0.001888)
43. feature 171 (0.001878)
44. feature 243 (0.001863)
45. feature 78 (0.001862)
46. feature 490 (0.001815)
47. feature 339 (0.001770)
48. feature 103 (0.001767)
49. feature 591 (0.001741)
50. feature 55 (0.001734)
51. feature 502 (0.001665)
52. feature 194 (0.001632)
53. feature 491 (0.001625)
54. feature 50 (0.001591)
55. feature 193 (0.001590)
56. feature 97 (0.001549)
57. feature 510 (0.001514)
58. feature 245 (0.001504)
59. feature 434 (0.001497)
60. feature 8 (0.001468)
61. feature 241 (0.001457)
62. feature 108 (0.001454)
63. feature 232 (0.001453)
64. feature 292 (0.001443)
65. feature 96 (0.001434)
66. feature 99 (0.001381)
67. feature 11 (0.001367)
68. feature 106 (0.001360)
69. feature 592 (0.001335)
70. feature 60 (0.001334)
71. feature 523 (0.001327)
72. feature 72 (0.001324)
73. feature 236 (0.001323)
74. feature 128 (0.001320)
75. feature 144 (0.001318)
76. feature 288 (0.001300)
77. feature 238 (0.001292)
78. feature 654 (0.001287)
79. feature 499 (0.001285)
80. feature 223 (0.001283)
81. feature 593 (0.001275)
82. feature 33 (0.001264)
83. feature 289 (0.001240)
84. feature 94 (0.001236)
85. feature 433 (0.001233)
86. feature 129 (0.001227)
87. feature 437 (0.001226)
88. feature 113 (0.001221)
89. feature 54 (0.001220)
90. feature 271 (0.001213)
91. feature 107 (0.001186)
92. feature 562 (0.001165)
93. feature 488 (0.001144)
94. feature 521 (0.001128)
95. feature 269 (0.001110)
96. feature 313 (0.001102)
97. feature 13 (0.001063)
98. feature 59 (0.001059)
99. feature 529 (0.001059)
100. feature 278 (0.001055)
101. feature 68 (0.001053)
102. feature 189 (0.001038)
103. feature 176 (0.001001)
104. feature 367 (0.001000)
105. feature 32 (0.001000)
106. feature 18 (0.000984)
107. feature 135 (0.000957)
108. feature 127 (0.000933)
109. feature 39 (0.000924)
110. feature 391 (0.000921)
111. feature 156 (0.000919)
112. feature 316 (0.000904)
113. feature 389 (0.000895)
114. feature 522 (0.000885)
115. feature 449 (0.000874)
116. feature 4 (0.000872)
117. feature 258 (0.000840)
118. feature 489 (0.000828)
119. feature 347 (0.000823)
120. feature 264 (0.000790)
从这里得到反馈后,我尝试:

importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
#So you just need to change this part accordingly (just change top_k to your desired value):

# Print the feature ranking
print("Feature ranking:")

for f in range(top_k):
    print("%d. feature %d (%f)" % (f + 1, new_indices[f], importances[new_indices[f]]))
#Same here for plotting the graph:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")

    importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
#So you just need to change this part accordingly (just change top_k to your desired value):

# Print the feature ranking
print("Feature ranking:")

for f in range(top_k):
    print("%d. feature %d (%f)" % (f + 1, new_indices[f], importances[new_indices[f]]))
#Same here for plotting the graph:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")


plt.xticks(range(new_indices), new_indices)
plt.xlim([-1, new_indices])
plt.savefig('fig1.png', dpi = 600)
plt.show()
plt.xticks(range(new_indices), new_indices)
plt.xlim([-1, new_indices])
plt.savefig('fig1.png', dpi = 600)
plt.show()
但是,我遇到了以下错误,因此我非常感谢您的支持,以克服此任务

Feature ranking:
1. feature 660 (0.405876)
2. feature 655 (0.138400)
3. feature 659 (0.056848)
4. feature 658 (0.056631)
5. feature 321 (0.014537)
6. feature 336 (0.013202)
7. feature 324 (0.012455)
8. feature 322 (0.011517)
9. feature 656 (0.011493)
10. feature 650 (0.010850)
Traceback (most recent call last):
  File "random_forest.py", line 234, in <module>
    plt.xticks(range(new_indices), new_indices)
TypeError: only integer scalar arrays can be converted to a scalar index
功能排名:
1.特征660(0.405876)
2.特征655(0.138400)
3.特征659(0.056848)
4.特征658(0.056631)
5.特征321(0.014537)
6.特征336(0.013220)
7.特征324(0.012455)
8.特征322(0.011517)
9功能部件656(0.011493)
10特征650(0.010850)
回溯(最近一次呼叫最后一次):
文件“random_forest.py”,第234行,在
plt.xticks(范围(新指数),新指数)
TypeError:只能将整数标量数组转换为标量索引

这是按降序排列重要特征索引的地方。这意味着使用索引[:10]将获得前10个功能

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
因此,您只需相应地更改此部分(只需将
top_k
更改为所需的值):

此处同样适用于绘制图形:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")

#Edited here (put top_k in range)
plt.xticks(range(top_k), new_indices)
#Edited here (put top_k)
plt.xlim([-1, top_k])
plt.show()

这是重要特征的索引按降序排列的地方。这意味着使用索引[:10]将获得前10个功能

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
因此,您只需相应地更改此部分(只需将
top_k
更改为所需的值):

此处同样适用于绘制图形:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")

#Edited here (put top_k in range)
plt.xticks(range(top_k), new_indices)
#Edited here (put top_k)
plt.xlim([-1, top_k])
plt.show()

您可以在中间代码部分使用切片来获得更改。您可以在中间代码部分使用切片来获得更改。@neo33我已经编辑了我的代码来解决这个问题。这对我来说是个错误。我用数组代替了整数。我现在已经解决了。并在代码中标记了我changed@neo33我已经编辑了我的代码来解决这个问题。这对我来说是个错误。我用数组代替了整数。我现在已经解决了。并在代码中标记了我更改的内容