Python 这是正确的tfidf吗？_Python_Scikit Learn_Tf Idf

Python 这是正确的tfidf吗？

python scikit-learn

Python 这是正确的tfidf吗？,python,scikit-learn,tf-idf,Python,Scikit Learn,Tf Idf,我正在尝试从文档中获取tfidf。但我不认为它给了我正确的价值观，或者我可能做错了什么。请建议。代码和输出如下： from sklearn.feature_extraction.text import TfidfVectorizer books = ["Hello there this is first book to be read by wordcount script.", "This is second book to be read by wordcount script. It ha

我正在尝试从文档中获取tfidf。但我不认为它给了我正确的价值观，或者我可能做错了什么。请建议。代码和输出如下：

from sklearn.feature_extraction.text import TfidfVectorizer
books = ["Hello there this is first book to be read by wordcount script.", "This is second book to be read by wordcount script. It has some additionl information.", "just third book."]
vectorizer = TfidfVectorizer()
response = vectorizer.fit_transform(books)
feature_names = vectorizer.get_feature_names()
for col in response.nonzero()[1]:
   print feature_names[col], '-', response[0, col]

更新1：（根据juanpa.arrivillaga的建议）

输出：

script - 0.269290317245
wordcount - 0.269290317245
by - 0.269290317245
read - 0.269290317245
be - 0.269290317245
to - 0.269290317245
book - 0.209127954024
first - 0.354084405732
is - 0.269290317245
this - 0.269290317245
there - 0.354084405732
hello - 0.354084405732
information - 0.0
...

更新1后的输出：

script - 0.256536760895
wordcount - 0.256536760895
by - 0.256536760895
read - 0.256536760895
be - 0.256536760895
to - 0.256536760895
book - 0.182528018244
first - 0.383055542114
is - 0.256536760895
this - 0.256536760895
there - 0.383055542114
hello - 0.383055542114
information - 0.0
...

tf = 1
idf= log(nd/df) +1 = log (3/1) +1= 0.47712125472 + 1= 1.47712 
tfidf = tf*idf = 1* 1.47712= 1.47712

根据我的理解，tfidf等于tf*idf。我手动计算它的方式作为示例：

文档1：“您好，这是wordcount脚本阅读的第一本书。” 文档2：“这是wordcount脚本阅读的第二本书。它有一些附加信息。” 文件3：“只是第三本书。”

Tfidf的问候语：

tf= 1/12(total terms in document 1)= 0.08333333333
idf= log(3(total documents)/1(no. of document with term in it))= 0.47712125472
0.08333333333*0.47712125472= 0.03976008865

这与下面的不同（hello-0.354084405732）

更新1后的手动计算：

script - 0.256536760895
wordcount - 0.256536760895
by - 0.256536760895
read - 0.256536760895
be - 0.256536760895
to - 0.256536760895
book - 0.182528018244
first - 0.383055542114
is - 0.256536760895
this - 0.256536760895
there - 0.383055542114
hello - 0.383055542114
information - 0.0
...

tf = 1
idf= log(nd/df) +1 = log (3/1) +1= 0.47712125472 + 1= 1.47712 
tfidf = tf*idf = 1* 1.47712= 1.47712

（与idf平滑后的代码输出“hello-0.38305552114”不同）

非常感谢您对理解正在发生的事情的任何帮助。

这里是一个没有平滑或规范化的输出：

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
   ...: books = ["Hello there this is first book to be read by wordcount script.", "This is second book to be read by wordcount sc
   ...: ript. It has some additionl information.", "just third book."]
   ...: vectorizer = TfidfVectorizer(smooth_idf=False, norm=None)
   ...: response = vectorizer.fit_transform(books)
   ...: feature_names = vectorizer.get_feature_names()
   ...: for col in response.nonzero()[1]:
   ...:    print(feature_names[col], '-', response[0, col])
   ...:
hello - 2.09861228867
there - 2.09861228867
this - 1.40546510811
is - 1.40546510811
first - 2.09861228867
book - 1.0
to - 1.40546510811
be - 1.40546510811
read - 1.40546510811
by - 1.40546510811
wordcount - 1.40546510811
script - 1.40546510811
this - 1.40546510811
is - 1.40546510811
book - 1.0
to - 1.40546510811
be - 1.40546510811
read - 1.40546510811
by - 1.40546510811
wordcount - 1.40546510811
script - 1.40546510811
second - 0.0
it - 0.0
has - 0.0
some - 0.0
additionl - 0.0
information - 0.0
book - 1.0
just - 0.0
third - 0.0

<> >考虑<代码>“hello”<代码>：

的结果。现在，手动：

In [3]: import math

In [4]: tf = 1

In [5]: idf = math.log(3/1) + 1

In [6]: tf*idf
Out[6]: 2.09861228866811

手动计算的问题是使用的是

log

base 10，但需要使用自然对数

如果您仍然强烈希望完成平滑和规范化步骤，那么这将使您能够正确地进行操作

你可以确切地看到所使用的内容。请注意，您没有执行IDF平滑，默认情况下，

TfidfVectorizer

会执行此操作。此外，文档似乎暗示术语频率是原始术语频率，而不是文档标准化的术语频率length@juanpa.arrivillaga,你能把你的评论变成一个答案吗？这可能有助于人们搜索/问同样的问题…@MaxU我现在没有时间，也许今天晚些时候…@MaxU举了一个基本的例子，不做平滑或正常化，只是作为一个理智的检查。非常感谢你的帮助。我现在明白了。你能不能建议如何删除与“VisibleDeprecationWarning:

rank

已被弃用；请改用

ndim

属性或函数”相同代码的警告。“@Manvi我不知道该警告来自何处。似乎你应该问另一个问题，但这似乎是不言自明的-不要对任何给你警告的函数使用

rank

参数，而是使用

ndim

或函数…@Manvi这似乎来自底层的

sklearn

实现，它使用了不推荐使用的

scipy

函数。可能这已经被修复了-我没有得到警告。尝试更新

sklearn

@Manvi我在版本18上，版本19几天前发布了！还有一件事，代码为少数打印tfidf=0.0。这正确吗？还有。。为什么在使用response.nonzero（）[1]时打印0。