Python（TextBlob）TF-IDF计算_Python_Tf Idf

Python（TextBlob）TF-IDF计算

python

Python（TextBlob）TF-IDF计算,python,tf-idf,Python,Tf Idf,我已经介绍了几种使用Python计算文档中TF-IDF单词分数的方法。我选择使用TextBlob 我得到一个输出，但是，它们是负值。我理解这是不正确的（非负数量（tf）除以（对数）正数量数量（df）不会产生负值）我已经看了下面贴在这里的问题：但是没有帮助我如何计算分数： def tf(word, blob): return blob.words.count(word) / len(blob.words) def n_containing(word, bloblist):

我已经介绍了几种使用Python计算文档中TF-IDF单词分数的方法。我选择使用TextBlob

我得到一个输出，但是，它们是负值。我理解这是不正确的（非负数量（tf）除以（对数）正数量数量（df）不会产生负值）

我已经看了下面贴在这里的问题：但是没有帮助

我如何计算分数：

 def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

 def n_containing(word, bloblist):
       return sum(1 for blob in bloblist if word in blob)

 def idf(word, bloblist):
       return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

 def tfidf(word, blob, bloblist):
       return tf(word, blob) * idf(word, bloblist)

然后我简单地打印出单词和它们的分数

    "hello, this is a test. a test is always good."


   Top words in document
   Word: good, TF-IDF: -0.06931
   Word: this, TF-IDF: -0.06931
   Word: always, TF-IDF: -0.06931
   Word: hello, TF-IDF: -0.06931
   Word: a, TF-IDF: -0.13863
   Word: is, TF-IDF: -0.13863
   Word: test, TF-IDF: -0.13863

就我所知和所见，可能是IDF计算不正确

一切帮助都将不胜感激。谢谢没有输入/输出示例，很难找出原因。
一种可能性是

idf（）

方法，它在每个

blob

中出现

word

时返回负值。发生这种情况是因为分母中的

+1

，我认为这是为了避免被零除。
一种可能的解决方法是显式检查零：

def idf(word, bloblist):
    x = n_containing(word, bloblist)
    return math.log(len(bloblist) / (x if x else 1))

注意：在这种情况下，恰好出现在一个blob中或根本没有blob中的单词将返回相同的值。还有其他解决方案可以满足您的需要-请记住不要只取分数的

日志。
IDF分数应该是非负的。问题在于idf
功能实现中
试试这个：
from __future__ import division
from textblob import TextBlob
import math

def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return 1 + sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
   return math.log(float(1+len(bloblist)) / float(n_containing(word,bloblist)))

def tfidf(word, blob, bloblist):
   return tf(word, blob) * idf(word, bloblist)

text = 'tf–idf, short for term frequency–inverse document frequency'
text2 = 'is a numerical statistic that is intended to reflect how important'
text3 = 'a word is to a document in a collection or corpus'

blob = TextBlob(text)
blob2 = TextBlob(text2)
blob3 = TextBlob(text3)
bloblist = [blob, blob2, blob3]
tf_score = tf('short', blob)
idf_score = idf('short', bloblist)
tfidf_score = tfidf('short', blob, bloblist)
print tf_score, idf_score, tfidf_score

x-if的对数0@yurib这些值不能是负数，因为它们存在于文档中……我同意tfidf分数不应该是负数，我要指出的是，从技术上讲，您对它的实现可能会返回负数结果。例如，如果一个单词出现在所有blob中，那么idf（）将返回log（len（bloblist）/（len（bloblist）+1），这将是负数。@yurib如何避免这种情况？因为我不确定它是否“正确”@user47467，那么这正是我描述的问题，你只有一个blob，因此每个单词都出现在“all”blob中，你使用分数的日志…你的方法为每个单词生成0分：/@user47467这是示例的正确tfidf分数，因为你只有一个文档，tfidf对多个documentseven添加另一个文档很有意义。。分数不正确。“test”出现在这两个文档中，分数为0，而其他文档的分数更高？@user47467这实际上是预期的，您应该阅读tf idf。它是衡量一个单词对一个文档的重要性的指标，如果一个单词出现在所有文档中，那么它对任何特定文档都不重要。请解释一下tf_score=tf（'movie'，blob）等，好的。术语频率计算单词在文档中的出现次数。所以它基本上计算了word movie在名为blob的文档中出现的次数。Idf otoh用于对所有文档中非常常见的单词进行惩罚。例如“the”、“a”等。