Python 使用scikit学习特征提取模块计算tf idf_Python_Machine Learning_Scikit Learn_Tf Idf

Python 使用scikit学习特征提取模块计算tf idf

python machine-learning scikit-learn

Python 使用scikit学习特征提取模块计算tf idf,python,machine-learning,scikit-learn,tf-idf,Python,Machine Learning,Scikit Learn,Tf Idf,请在标记之前完整阅读此帖子。我在互联网上到处寻找，试图弄明白这一点我只是在努力完成，但在复制结果时遇到了困难。此外，我无法从数学上复制我正在产生的结果。在我尝试生成tfidf之前，一切都很清楚，如下面代码中的注释所示具体来说，我是否正确生成了下面的tfidf？如果是这样的话，我怎么能在数学上复制呢？据我所知，它应该是tf*idf，两者都是简单的计算，如下面的评论所示提前谢谢 from sklearn.feature_extraction.text import CountVectoriz

请在标记之前完整阅读此帖子。我在互联网上到处寻找，试图弄明白这一点

我只是在努力完成，但在复制结果时遇到了困难。此外，我无法从数学上复制我正在产生的结果。在我尝试生成tfidf之前，一切都很清楚，如下面代码中的注释所示

具体来说，我是否正确生成了下面的tfidf？如果是这样的话，我怎么能在数学上复制呢？据我所知，它应该是tf*idf，两者都是简单的计算，如下面的评论所示

提前谢谢

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# Vocabulary: {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}
freq_term_matrix = vectorizer.transform(test_set)
print freq_term_matrix.todense()
# [[0 1 1 1]
# [0 1 0 2]]

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
# The arguments as the are passed int0 TfidfTransformer: 
# TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

print "IDF:", tfidf.idf_
# This is where the confusion begins. What are these numbers?
# IDF: [ 2.09861229  1.          1.40546511  1.        ]


tf_idf_matrix = tfidf.transform(freq_term_matrix)
print tf_idf_matrix.todense()
# It's my understanding that these are simply tf * idf where 
# tf = (number of times a word appears in a doc) / (number of words in document)
# idf = log((number of documents) / (number of docs the word appears in))
# [[ 0.          0.50154891  0.70490949  0.50154891]
# [ 0.          0.4472136   0.          0.89442719]]

您可以查看中的描述detail@VivekKumar你好！谢谢你的回复。我仍然看到一些问题。当我将smooth_idf设置为false时，我在idf中返回一个inf值。