Python 使用scikit学习特征提取模块计算tf idf

Python 使用scikit学习特征提取模块计算tf idf,python,machine-learning,scikit-learn,tf-idf,Python,Machine Learning,Scikit Learn,Tf Idf,请在标记之前完整阅读此帖子。我在互联网上到处寻找,试图弄明白这一点 我只是在努力完成,但在复制结果时遇到了困难。此外,我无法从数学上复制我正在产生的结果。在我尝试生成tfidf之前,一切都很清楚,如下面代码中的注释所示 具体来说,我是否正确生成了下面的tfidf? 如果是这样的话,我怎么能在数学上复制呢?据我所知,它应该是tf*idf,两者都是简单的计算,如下面的评论所示 提前谢谢 from sklearn.feature_extraction.text import CountVectoriz

请在标记之前完整阅读此帖子。我在互联网上到处寻找,试图弄明白这一点

我只是在努力完成,但在复制结果时遇到了困难。此外,我无法从数学上复制我正在产生的结果。在我尝试生成tfidf之前,一切都很清楚,如下面代码中的注释所示

具体来说,我是否正确生成了下面的tfidf? 如果是这样的话,我怎么能在数学上复制呢?据我所知,它应该是tf*idf,两者都是简单的计算,如下面的评论所示

提前谢谢

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# Vocabulary: {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}
freq_term_matrix = vectorizer.transform(test_set)
print freq_term_matrix.todense()
# [[0 1 1 1]
# [0 1 0 2]]

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
# The arguments as the are passed int0 TfidfTransformer: 
# TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

print "IDF:", tfidf.idf_
# This is where the confusion begins. What are these numbers?
# IDF: [ 2.09861229  1.          1.40546511  1.        ]


tf_idf_matrix = tfidf.transform(freq_term_matrix)
print tf_idf_matrix.todense()
# It's my understanding that these are simply tf * idf where 
# tf = (number of times a word appears in a doc) / (number of words in document)
# idf = log((number of documents) / (number of docs the word appears in))
# [[ 0.          0.50154891  0.70490949  0.50154891]
# [ 0.          0.4472136   0.          0.89442719]]

您可以查看中的描述detail@VivekKumar你好!谢谢你的回复。我仍然看到一些问题。当我将smooth_idf设置为false时,我在idf中返回一个inf值。