Python 3.x 如何使用Scikit学习计数矢量器？_Python 3.x_Scikit Learn_Countvectorizer

Python 3.x 如何使用Scikit学习计数矢量器？

python-3.x scikit-learn

Python 3.x 如何使用Scikit学习计数矢量器？,python-3.x,scikit-learn,countvectorizer,Python 3.x,Scikit Learn,Countvectorizer,我有一组单词，我必须检查它们是否存在于文档中 WordList = [w1, w2, ..., wn] 另一组有文件清单，我必须检查这些文字是否存在如何使用scikit learnCountVectorizer，以便术语文档矩阵的功能仅为WordList中的单词，并且每行表示每个特定文档，给定列表中的单词在其各自列中出现的次数为多少？确定。我明白了。代码如下： from sklearn.feature_extraction.text import CountVectorizer # Cou

我有一组单词，我必须检查它们是否存在于文档中

WordList = [w1, w2, ..., wn]

另一组有文件清单，我必须检查这些文字是否存在

如何使用scikit learn

CountVectorizer

，以便术语文档矩阵的功能仅为

WordList

中的单词，并且每行表示每个特定文档，给定列表中的单词在其各自列中出现的次数为多少？

确定。我明白了。代码如下：

from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document. 
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text 
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()

这将仅输出术语文档矩阵，其功能仅来自单词列表。

对于自定义文档，您可以使用计数向量器方法

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
      'This is a cat.',
      'It likes to roam in the garden',
      'It is black in color',
      'The cat does not like the dog.',
      ]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words

vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
 'dog', 'black', 'like', 'does', 'not',
 'the', 'in', 'likes'])

X.toarray()
#used to convert X into numpy array

vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document

也可以使用其他矢量器，如Tfidf矢量器。Tfidf矢量器是一种更好的方法，因为它不仅提供特定文档中出现的单词数量，而且还说明单词的重要性

它是通过求TF-项频率和IDF-逆文档频率来计算的

术语Freq是单词在特定文档中出现的次数，IDF是根据文档的上下文计算的。例如，如果这些文件与足球有关，那么“the”一词不会给出任何见解，但“messi”一词会说明文件的上下文。它是通过记录发生次数来计算的。例如，tf（“the”）=10 tf（“梅西”）=5

这些权重有助于算法识别文档中的重要词语，从而有助于从文档中派生语义。

请更正答案中的小错误。日志（10）==1和idf！=日志（tf）。实际上是idf（d，t）=log[n/df（d，t）]+1（如果

smooth\u idf=False

），其中n是文档总数，df（d，t）是文档频率；上面的代码似乎工作正常，但我的单词列表有500个元素，tfMatrix[0]，tfMatrix[1]，tfMatrix[n]应该有500个元素，但它只有464个元素。原因可能是什么？

idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52

tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6