Machine learning 获取随机分类器的特征名称（单词）-使用spaCy进行文本分类_Machine Learning_Nlp_Random Forest_Spacy_Text Classification

Machine learning 获取随机分类器的特征名称（单词）-使用spaCy进行文本分类

machine-learning nlp

Machine learning 获取随机分类器的特征名称（单词）-使用spaCy进行文本分类,machine-learning,nlp,random-forest,spacy,text-classification,Machine Learning,Nlp,Random Forest,Spacy,Text Classification,我在尝试根据医学技术与非医学技术相关的连接文本对一些专利进行分类时，准确度令人怀疑。出于这个原因，我想看看用于分类的最重要的单词我使用了spaCy模型的教程，但使用了RandomClassifier而不是LinearSVC，因为LinearSVC不支持predict_proba，这与我的问题更相关。这是我的代码： def printNMostInformative(vectorizer, clf, N): feature_names = vectorizer.get_feature_n

我在尝试根据医学技术与非医学技术相关的连接文本对一些专利进行分类时，准确度令人怀疑。出于这个原因，我想看看用于分类的最重要的单词

我使用了spaCy模型的教程，但使用了RandomClassifier而不是LinearSVC，因为LinearSVC不支持predict_proba，这与我的问题更相关。这是我的代码：

def printNMostInformative(vectorizer, clf, N):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))
clf = RandomForestClassifierWithCoef(n_estimators=1000, random_state=0)
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])

# data
train1 = train['Whole_text'].tolist()
labelsTrain1 = train['Med_area'].tolist()

test1 = test['Whole_text'].tolist()
labelsTest1 = test['Med_area'].tolist()
# train
pipe.fit(train1, labelsTrain1)

# test
preds = pipe.predict(test1)
print("accuracy:", accuracy_score(labelsTest1, preds))
print("Top 10 features used to predict: ")
printNMostInformative(vectorizer, clf, 10)

pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])
transform = pipe.fit_transform(train1, labelsTrain1)
vocab = vectorizer.get_feature_names()

for i in range(len(train1)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx, num in zip(indexIntoVocab, numOccurences):
        s += str((vocab[idx], num))

我不断地发现这个错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-23-4e74698a75fc> in <module>
     33 print("accuracy:", accuracy_score(labelsTest1, preds))
     34 print("Top 10 features used to predict: ")
---> 35 printNMostInformative(vectorizer, clf, 10)
     36 
     37 pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])

<ipython-input-23-4e74698a75fc> in printNMostInformative(vectorizer, clf, N)
      1 def printNMostInformative(vectorizer, clf, N):
      2     feature_names = vectorizer.get_feature_names()
----> 3     coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
      4     topClass1 = coefs_with_fns[:N]
      5     topClass2 = coefs_with_fns[:-(N + 1):-1]

**TypeError: zip argument #1 must support iteration**

TypeError回溯（最近一次调用）
在里面
33打印（“准确度：”，准确度\分数（标签测试1，preds））
34打印（“用于预测的前10项功能：”）
--->35 printNMostInformative（矢量器，clf，10）
36
37管道=管道（[（'cleanText'，CleanTextTransformer（）），（'vectorizer'，vectorizer）]）
在printNMostInformative中（矢量器、clf、N）
1 def printNMostInformative（矢量器、clf、N）：
2功能名称=矢量器。获取功能名称（）
---->3个coefs\u，其中，fns=已排序（zip（clf.coef\u[0]，特征名称））
4 topClass1=coefs_与_fns[：N]
5 topClass2=coefs_与_fns[：-（N+1）：-1]
**TypeError:zip参数#1必须支持迭代**

我有两个问题：

我怎样才能解决这个问题并看到每堂课最重要的单词（特征）

如果我使用predict_proba和roc_auc_分数，有什么办法可以看出这一点吗

欢迎来到SO！请考虑和思考一次问1个问题。