Python 为什么TF-IDF计算要花这么多时间？_Python_Dictionary_Nlp_Tf Idf

Python 为什么TF-IDF计算要花这么多时间？

python dictionary nlp

Python 为什么TF-IDF计算要花这么多时间？,python,dictionary,nlp,tf-idf,Python,Dictionary,Nlp,Tf Idf,我在我的文档库中使用了来自的TF-IDF代码，这是3个PDF文档，每个文档大约270页长 # Calculating the Term Frequency, Inverse Document Frequency score import os import math from textblob import TextBlob as tb def tf(word, blob): return tb(blob).words.count(word) / len(tb(blob).words)

我在我的文档库中使用了来自的TF-IDF代码，这是3个PDF文档，每个文档大约270页长

# Calculating the Term Frequency, Inverse Document Frequency score
import os
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return tb(blob).words.count(word) / len(tb(blob).words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in tb(blob).words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)




# Stemming the articles
from nltk.stem import PorterStemmer
port = PorterStemmer()

bloblist = []
doclist = [pdf1, pdf2, pdf3]   # Defined earlier, not showing here as it is not relevant to the question
for doc in doclist:
    bloblist.append(port.stem(str(doc)))




# TF-IDF calculation on the stemmed articles
for index, blob in enumerate(bloblist):
    print("Top words in document {}".format(index + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in tb(blob).words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    i=1
    for word, score in sorted_words[:5]:
        print("\tWord "+str(i)+": {}, TF-IDF: {}".format(word, round(score, 5)))
        i+=1

问题是，它只是一直在运行，没有显示文档1中顶部单词以外的任何内容。为什么计算

分数要花这么长时间？我已经让它运行了一个小时了，代码还没有终止。早些时候，我尝试了50多个txt文件的代码，这些文件的长度要短得多（比如，平均2-3段），在那里，它能够即时显示TF-IDF分数。3份270页的文档有什么问题？有些东西会从粗略的一瞥中弹出，
1） 没有看到方法tb是如何实现的，但是您似乎在为每个单词调用tb（blob）
。也许从任何tb（blob）中生成一个对象，每个单词都会返回一次，这会加快速度。
2） nltk
有自己的tfidf实现，这将更加优化，并且可以加快速度。
3） 您可以使用numpy
而不是普通的python来实现，这肯定会加快速度。但即使这样，最好还是缓存结果并使用它们，而不是多次调用可能很重的函数。
正如另一个答案所述，您正在调用tb（blob）
太多；对于包含N个单词的文档，您调用它的次数似乎超过了N^2次。这总是很慢的。您需要进行如下更改：
对于索引，枚举中的blob（bloblist）：
打印（“文档{}中的顶部单词”。格式（索引+1））
#XXX在这里只使用一次textblob
tblob=tb（blob）
得分={word:tfidf（word，tblob，bloblist）表示tblob.words中的单词}
排序的单词=排序的（scores.items（），key=lambda x:x[1]，reverse=True）
i=1
对于单词，在排序的单词中得分[：5]：
打印（“\tWord”+str（i）+“：{}，TF-IDF:{}”。格式（单词，四舍五入（分数，5）））
i+=1

您还需要更改tfidf函数，以便它们使用tblob
，而不是每次调用tb（blob）
。
我这样做了，但似乎没有任何区别。代码仍然持续运行数小时，没有输出。在这种情况下，听起来你应该运行一个探查器——似乎你没有做任何需要那么长时间的事情。