Python 3.x 一种在语料库中查找不常用词的高效方法_Python 3.x_Function_Performance_Filter

Python 3.x 一种在语料库中查找不常用词的高效方法

python-3.x function performance filter

Python 3.x 一种在语料库中查找不常用词的高效方法,python-3.x,function,performance,filter,Python 3.x,Function,Performance,Filter,我有以下格式的语料库： corpus = [['tokenized_text_1'], ['tokenized_text_2'], .... ,['tokenized_text_n']] 我想从中删除不常见的词 def remove_uncommon_words (corpus, threshold): uncommon_words = [] word_count = Counter(corpus) for word in word_count: if

我有以下格式的语料库：

corpus = [['tokenized_text_1'], ['tokenized_text_2'], .... ,['tokenized_text_n']]

我想从中删除不常见的词

def remove_uncommon_words (corpus, threshold):
    uncommon_words = []
    word_count = Counter(corpus)
    for word in word_count:
        if word_count[word] < threshold:
            uncommon_words.append(word)
        else:
            continue
    clean_corpus = []
    for doc in corpus:
        clean_corpus.append([word for word in doc if word not in uncommon_words])
    return clean_corpus

def删除不常见词（语料库，基本要求）：
不常见的单词=[]
单词计数=计数器（语料库）
对于word\u计数中的单词：
如果字计数[字]<阈值：
不常见的单词。附加（单词）
其他：
持续
清洁语料库=[]
对于语料库中的文档：
clean_corpus.append（[如果单词不在不常见单词中，则在文档中逐字添加]）
返回干净的语料库

但是，此代码需要很长时间才能执行。我可以做些什么来更快地执行相同的任务？

这里有一个更简洁且可能更快的版本，它主要依赖于

dict

理解和

set

操作，这些操作通常比

list

操作b/c更快。它们是无序的，可以使用哈希来代替：

from itertools import chain
from collections import Counter

def remove_uncommon_words (corpus, threshold):
    word_count = Counter(chain(*corpus))
    uncommon_words = {w:c for w,c in word_count.items() if c < threshold}
    clean_corpus = word_count.keys() - uncommon_words.keys()
    return list(uncommon_words), list(clean_corpus)

来自itertools导入链的


从收款进口柜台
def删除不常见词（语料库，基本要求）：
单词计数=计数器（链（*语料库））
不常见的单词={w:c代表w，如果c


我认为总的来说两者都是O（n），其中n是语料库的大小，但这是不可避免的，因为在某个时候，你必须检查每一个单词。您的原始代码在整个语料库中迭代了两次，而我的代码只迭代了一次，所以我相信我的代码会更快一点，但总的来说时间复杂度是相同的
注：实际上，你的代码以前没有运行过，因为列表是不可破坏的，而且由于corpus
是一个列表列表，你不能直接执行Counter（corpus）
。我用链解决了这个问题
-见上文。
什么是“很长”？您是否对此进行了分析，以确定花费的时间最多的是什么？我认为有更好的选择，可以使用一个列表来存储不常见的单词
。看起来函数计数器也会读取文档以形成字典单词数。我会用一组不常见的单词，因为它比列表快。谢谢Victor，这要快得多。显然，我的数据格式与我在问题中所写的不同。您建议的使用chain的方法对于我上面提到的格式非常有效