Python 对于scikit learn中的每个文件，如何从TD idf向量中获取最高频率项？_Python_Parsing_Machine Learning_Classification_Scikit Learn

Python 对于scikit learn中的每个文件，如何从TD idf向量中获取最高频率项？

python parsing machine-learning scikit-learn

Python 对于scikit learn中的每个文件，如何从TD idf向量中获取最高频率项？,python,parsing,machine-learning,classification,scikit-learn,Python,Parsing,Machine Learning,Classification,Scikit Learn,我试图在scikit learn中从向量中获得最高频率的项。从示例中，可以对每个类别使用此选项，但我希望对类别中的每个文件使用此选项我想对测试数据集中的每个文件执行此操作，而不是对每个类别执行此操作。我应该去哪里谢谢编辑：s/discritive/highest frequency/g（抱歉混淆）似乎没有人知道。我在这里回答，因为其他人也面临同样的问题，我现在到哪里去找，还没有完全实施它位于sklearn.feature_extraction.text的CountVectorize

我试图在scikit learn中从向量中获得最高频率的项。从示例中，可以对每个类别使用此选项，但我希望对类别中的每个文件使用此选项

我想对测试数据集中的每个文件执行此操作，而不是对每个类别执行此操作。我应该去哪里

谢谢

编辑：s/discritive/highest frequency/g（抱歉混淆）

似乎没有人知道。我在这里回答，因为其他人也面临同样的问题，我现在到哪里去找，还没有完全实施

它位于sklearn.feature_extraction.text的CountVectorizer的深处：

def transform(self, raw_documents):
    """Extract token counts out of raw text documents using the vocabulary
    fitted with fit or the one provided in the constructor.

    Parameters
    ----------
    raw_documents: iterable
        an iterable which yields either str, unicode or file objects

    Returns
    -------
    vectors: sparse matrix, [n_samples, n_features]
    """
    if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
        raise ValueError("Vocabulary wasn't fitted or is empty!")

    # raw_documents can be an iterable so we don't know its size in
    # advance

    # XXX @larsmans tried to parallelize the following loop with joblib.
    # The result was some 20% slower than the serial version.
    analyze = self.build_analyzer()
    term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
    self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
    return self._term_count_dicts_to_matrix(term_counts_per_doc)

这是我的fork，我还提交了pull请求：

如果有更好的方法，请告诉我。

您可以将转换结果与

get\u feature\u names

一起使用，以获取给定文档的术语计数

X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])

你不能用解析训练数据时使用的向量器来转换测试数据吗。矢量器在调用

fit

后存储词汇表，

transform

使用该词汇表过滤传入的任何数据（根据文档）。词汇表不存储有关它从哪个文档（或数组/列表索引）获取的任何信息。这只是随意的，如果你查看scikit学习源代码，你会看到。经过测试和更正。我正要发布几乎相同的答案：）get\u feature\u names意味着向量器。get\u feature\u names（）？

terms=np.array（vectorizer.get\u feature\u names（））

first\u top=zip（terms，X\u test.toarray（）[0]）

这还不起作用。它检索所有可用的术语argh@V3ss0n：这些不是区别性术语，只是高频术语。使用

sorted

，

heap.nlagest

或任何你喜欢的Python技巧，从

terms\u for_first\u doc

中获得你想要的术语：为什么-1，它是一个有效的解决方案（但需要修改scikit learn）享受你的否决权。巨魔

load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)


data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'

categories = None    # for case categories == None

print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print

# split a training set and a test set

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
                             stop_words='english',charset_error="ignore")

X_train = vectorizer.fit_transform(data_train.data)


print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print

print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
    print counter

X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])