Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从头开始实现TF-IDF矢量器_Python_Machine Learning_Nlp_Tf Idf - Fatal编程技术网

Python 从头开始实现TF-IDF矢量器

Python 从头开始实现TF-IDF矢量器,python,machine-learning,nlp,tf-idf,Python,Machine Learning,Nlp,Tf Idf,我正在尝试用Python从头开始实现tf idf矢量器。我计算了TDF值,但这些值与使用sklearn的TFIDFvectorier计算的TDF值不匹配 我做错了什么 corpus = [ 'this is the first document', 'this document is the second document', 'and this is the third one', 'is this the first document', ] from collections im

我正在尝试用Python从头开始实现tf idf矢量器。我计算了TDF值,但这些值与使用sklearn的TFIDFvectorier计算的TDF值不匹配

我做错了什么

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',
]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

sentence = []
for i in range(len(corpus)):
sentence.append(corpus[i].split())

word_freq = {}   #calculate document frequency of a word
for i in range(len(sentence)):
    tokens = sentence[i]
    for w in tokens:
        try:
            word_freq[w].add(i)  #add the word as key 
        except:
            word_freq[w] = {i}  #if it exists already, do not add.

for i in word_freq:
    word_freq[i] = len(word_freq[i])  #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.

def idf():
    idfDict = {}
    for word in word_freq:
        idfDict[word] = math.log(len(sentence) / word_freq[word])
    return idfDict
idfDict = idf()
预期产出: 使用vectorizer.idf获得的输出_

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]
实际产量: 这些值是相应键的idf值

{'and': 1.3862943611198906,
'document': 0.28768207245178085,
'first': 0.6931471805599453,
'is': 0.0,
'one': 1.3862943611198906,
'second': 1.3862943611198906,
'the': 0.0,
'third': 1.3862943611198906,
'this': 0.0
 }

有几个默认参数可能会影响sklearn的计算,但这里最重要的参数是:

平滑_idf:boolean默认值=True 通过向文档频率添加一个来平滑idf权重,就好像一个额外的文档恰好包含集合中的每个术语一次。防止零分割

如果从每个元素中减去一个,并将e提高到该次方,则得到的值非常接近5/n,对于低值n:

1.91629073 => 5/2
1.22314355 => 5/4
1.51082562 => 5/3
1 => 5/5
无论如何,没有一个tf idf实施;您定义的度量仅仅是一种启发式方法,它试图观察某些属性,例如较高的idf应该与语料库中的稀有性相关,因此我不太担心实现相同的实现

sklearn似乎使用了: 日志文件长度+1/单词频率+1+1 这很像是有一个文档包含了语料库中的每一个单词


编辑:最后一段由的文档字符串证实。

我修复了代码。显然,Sklearn使用的是按字母顺序排序的词汇。我尝试使用排序和未排序的词汇运行程序,但输出是相同的。此外,Sklearn计算IDF的公式与教科书公式略有不同。公式是:是的,这是我在回复底部提供的链接。这只是他们选择哪种度量标准的问题,而不是您的实现是否错误的问题。