在NLP中使用tf idf如何从python中的语料库(包含大量文档)中查找特定单词的频率
如何使用Tf idf从语料库中查找单个单词的频率。下面是我的示例代码,现在我想打印一个单词的频率。我怎样才能做到这一点在NLP中使用tf idf如何从python中的语料库(包含大量文档)中查找特定单词的频率,python,nlp,tf-idf,n-gram,countvectorizer,Python,Nlp,Tf Idf,N Gram,Countvectorizer,如何使用Tf idf从语料库中查找单个单词的频率。下面是我的示例代码,现在我想打印一个单词的频率。我怎样才能做到这一点 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = ['This is the first document.', 'This is the second second document.', 'And t
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',]
X = vectorizer.fit_transform(corpus)
X
print(vectorizer.get_feature_names())
X.toarray()
vectorizer.vocabulary_.get('document')
print(vectorizer.get_feature_names())
X.toarray()
vectorizer.vocabulary_.get('document')
您的
矢量化程序。词汇表\uu
具有每个单词的计数:
print(vectorizer.volcabulary_)
{'this': 8,
'is': 3,
'the': 6,
'first': 2,
'document': 1,
'second': 5,
'and': 0,
'third': 7,
'one': 4}
因此,计算词频很简单:
vocab = vectorizer.vocabulary_
tot = sum(vocab.values())
frequency = {vocab[w]/tot for w in vocab.keys()}
您的
矢量化程序。词汇表\uu
具有每个单词的计数:
print(vectorizer.volcabulary_)
{'this': 8,
'is': 3,
'the': 6,
'first': 2,
'document': 1,
'second': 5,
'and': 0,
'third': 7,
'one': 4}
因此,计算词频很简单:
vocab = vectorizer.vocabulary_
tot = sum(vocab.values())
frequency = {vocab[w]/tot for w in vocab.keys()}