Nlp scikit learn TFIDFvectorier如何计算TF IDF

Nlp scikit learn TFIDFvectorier如何计算TF IDF,nlp,scikit-learn,tf-idf,Nlp,Scikit Learn,Tf Idf,我运行以下代码将文本矩阵转换为TF-IDF矩阵 text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF'] from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=1.0, mi

我运行以下代码将文本矩阵转换为TF-IDF矩阵

text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到以下输出

X_vovab=

[u'calculation',
 u'computation',
 u'idf',
 u'product',
 u'string',
 u'tf',
 u'tfidf']
和X_mat=

  ([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ,  1.51082562,
      0.        ,  0.        ],
    [ 1.91629073,  1.91629073,  0.        ,  0.        ,  0.        ,
      0.        ,  1.51082562],
    [ 0.        ,  0.        ,  1.91629073,  1.91629073,  0.        ,
      1.91629073,  1.51082562]])

现在我不明白这些分数是怎么计算的。我的想法是,对于文本[0],只计算“string”的分数,并且在第5列中有一个分数。但由于TF_,IDF是术语频率的乘积,即2,IDF是log(4/2),因此如矩阵所示,IDF为1.39,而不是1.51。如何在scikit learn中计算TF-IDF分数。

精确的计算公式见:

tf idf的实际公式为tf*(idf+1)=tf+tf*idf,而不是tf*idf

通过向文档频率添加一个来平滑idf权重,就好像一个额外的文档恰好包含集合中的每个术语一次


这意味着
1.51082562
的获取方式是
1.51082562=1+ln((4+1)/(2+1))
TF-IDF是由Scikit Learn的TfidfVectorizer通过多个步骤完成的,它实际上使用了TfidfTransformer并继承了CountVectorizer

让我总结一下它为使其更简单而采取的步骤:

  • tfs由CountVectorizer的fit_transform()计算
  • IDF由TfidfTransformer的拟合()计算
  • TFIDF由TfidfTransformer的transform()计算
  • 您可以检查源代码

    回到你的例子。以下是词汇表第5项的tfidf权重计算,第1份文件(X_mat[0,4]):

    首先,第一个文档中“字符串”的tf:

    tf = 1
    
    其次,“字符串”的idf,启用平滑(默认行为):

    最后是(文档0,功能4)的tfidf权重:

    我注意到您选择不规范化tfidf矩阵。请记住,规范化tfidf矩阵是一种常见且通常推荐的方法,因为大多数模型都要求规范化特征矩阵(或设计矩阵)


    作为计算的最后一步,默认情况下,TfidfVectorizer将L-2标准化输出矩阵。将其标准化意味着其权重仅在0和1之间。

    因此,1.51只表示IDF分数,而不表示TF-IDF分数。TF-IDF分数我猜应该是2*1.51=3.02。术语频率只有1,不是吗?这就是为什么我们有1*1.51,现在我开始。谢谢。这真是个好答案!!我花了一整天的时间来理解这一点@Rabbit你能在这个例子中演示如何应用规范化吗?很好的解释,这里有一个简短的提示,这里sklearn使用的对数是自然对数,如果你是手工(或计算器)推导的,请使用“ln”而不是以对数为基数的10。@cemsazara我修复了我错误使用“log”而不是“ln”的部分,谢谢
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from collections import Counter
    corpus = [
         'This is the first document.',
         'This document is the second document.',
         'And this is the third one.',
         'Is this the first document?',
     ]
    print(corpus)
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names())
    
    z=X.toarray()
    #term frequency is printed
    print(z)
    
    vectorizer1 = TfidfVectorizer(min_df=1)
    X1 = vectorizer1.fit_transform(corpus)
    idf = vectorizer1.idf_
    print (dict(zip(vectorizer1.get_feature_names(), idf)))
    #printing idf
    print(X1.toarray())
    #printing tfidf
    
    #formula 
    # df = 2
    # N = 4
    # idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238
    
    #formula
    # tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
    
    tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from collections import Counter
    corpus = [
         'This is the first document.',
         'This document is the second document.',
         'And this is the third one.',
         'Is this the first document?',
     ]
    print(corpus)
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names())
    
    z=X.toarray()
    #term frequency is printed
    print(z)
    
    vectorizer1 = TfidfVectorizer(min_df=1)
    X1 = vectorizer1.fit_transform(corpus)
    idf = vectorizer1.idf_
    print (dict(zip(vectorizer1.get_feature_names(), idf)))
    #printing idf
    print(X1.toarray())
    #printing tfidf
    
    #formula 
    # df = 2
    # N = 4
    # idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238
    
    #formula
    # tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238