Nlp scikit learn TFIDFvectorier如何计算TF IDF_Nlp_Scikit Learn_Tf Idf

Nlp scikit learn TFIDFvectorier如何计算TF IDF

nlp scikit-learn

Nlp scikit learn TFIDFvectorier如何计算TF IDF,nlp,scikit-learn,tf-idf,Nlp,Scikit Learn,Tf Idf,我运行以下代码将文本矩阵转换为TF-IDF矩阵 text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=1.0, mi

我运行以下代码将文本矩阵转换为TF-IDF矩阵

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_

我得到以下输出

X_vovab=

[u'calculation',
 u'computation',
 u'idf',
 u'product',
 u'string',
 u'tf',
 u'tfidf']

和X_mat=

  ([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 1.91629073,  1.91629073,  0.        ,  0.        ,  0.        ,
      0.        ,  1.51082562],
    [ 0.        ,  0.        ,  1.91629073,  1.91629073,  0.        ,
      1.91629073,  1.51082562]])

现在我不明白这些分数是怎么计算的。我的想法是，对于文本[0]，只计算“string”的分数，并且在第5列中有一个分数。但由于TF_，IDF是术语频率的乘积，即2，IDF是log（4/2），因此如矩阵所示，IDF为1.39，而不是1.51。如何在scikit learn中计算TF-IDF分数。

精确的计算公式见：

tf idf的实际公式为tf*（idf+1）=tf+tf*idf，而不是tf*idf

及

通过向文档频率添加一个来平滑idf权重，就好像一个额外的文档恰好包含集合中的每个术语一次

这意味着

1.51082562

的获取方式是

1.51082562=1+ln（（4+1）/（2+1））

TF-IDF是由Scikit Learn的TfidfVectorizer通过多个步骤完成的，它实际上使用了TfidfTransformer并继承了CountVectorizer

让我总结一下它为使其更简单而采取的步骤：

tfs由CountVectorizer的fit_transform（）计算

IDF由TfidfTransformer的拟合（）计算

TFIDF由TfidfTransformer的transform（）计算

您可以检查源代码

回到你的例子。以下是词汇表第5项的tfidf权重计算，第1份文件（X_mat[0,4]）：

首先，第一个文档中“字符串”的tf：

tf = 1

其次，“字符串”的idf，启用平滑（默认行为）：

最后是（文档0，功能4）的tfidf权重：

我注意到您选择不规范化tfidf矩阵。请记住，规范化tfidf矩阵是一种常见且通常推荐的方法，因为大多数模型都要求规范化特征矩阵（或设计矩阵）

作为计算的最后一步，默认情况下，TfidfVectorizer将L-2标准化输出矩阵。将其标准化意味着其权重仅在0和1之间。

因此，1.51只表示IDF分数，而不表示TF-IDF分数。TF-IDF分数我猜应该是2*1.51=3.02。术语频率只有1，不是吗？这就是为什么我们有1*1.51，现在我开始。谢谢。这真是个好答案！！我花了一整天的时间来理解这一点@Rabbit你能在这个例子中演示如何应用规范化吗？很好的解释，这里有一个简短的提示，这里sklearn使用的对数是自然对数，如果你是手工（或计算器）推导的，请使用“ln”而不是以对数为基数的10。@cemsazara我修复了我错误使用“log”而不是“ln”的部分，谢谢

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
print(corpus)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

z=X.toarray()
#term frequency is printed
print(z)

vectorizer1 = TfidfVectorizer(min_df=1)
X1 = vectorizer1.fit_transform(corpus)
idf = vectorizer1.idf_
print (dict(zip(vectorizer1.get_feature_names(), idf)))
#printing idf
print(X1.toarray())
#printing tfidf

#formula 
# df = 2
# N = 4
# idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238

#formula
# tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
print(corpus)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

z=X.toarray()
#term frequency is printed
print(z)

vectorizer1 = TfidfVectorizer(min_df=1)
X1 = vectorizer1.fit_transform(corpus)
idf = vectorizer1.idf_
print (dict(zip(vectorizer1.get_feature_names(), idf)))
#printing idf
print(X1.toarray())
#printing tfidf

#formula 
# df = 2
# N = 4
# idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238

#formula
# tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238