Nlp gensimpython：有没有一种简单的方法来获取给定令牌在所有文档中出现的次数？_Nlp_Gensim

Nlp gensimpython：有没有一种简单的方法来获取给定令牌在所有文档中出现的次数？

nlp

Nlp gensimpython：有没有一种简单的方法来获取给定令牌在所有文档中出现的次数？,nlp,gensim,Nlp,Gensim,我的gensim模型如下所示： class MyCorpus(object): parametersList = [] def __init__(self,dictionary): self.dictionary=dictionary def __iter__(self): #for line in open('mycorpus.txt'): for line in texts: # assume th

我的gensim模型如下所示：

class MyCorpus(object):
    parametersList = []
    def __init__(self,dictionary):
       self.dictionary=dictionary
    def __iter__(self):
        #for line in open('mycorpus.txt'):
        for line in texts:
            # assume there's one document per line, tokens separated by whitespace
            yield self.dictionary.doc2bow(line[0].lower().split())




if __name__=="__main__":
    texts=[['human human interface computer'],
             ['survey user user computer system system system response time'],
             ['eps user interface system'],
             ['system human system eps'],
             ['user response time'],
             ['trees'],
             ['graph trees'],
             ['graph minors trees'],
             ['graph minors minors survey survey survey']]


    dictionary = corpora.Dictionary(line[0].lower().split() for line in texts)

    corpus= MyCorpus(dictionary)

自动评估每个文档中每个标记的频率

我还可以定义tf idf模型，并访问每个文档中每个令牌的tf idf统计信息

model = TfidfModel(corpus)

但是，我不知道如何计算（内存友好型）给定单词出现的文档数。我该怎么做？[当然……我可以使用tf idf和文档频率的值来评估它……但是，我想直接从一些计数过程中评估它]

例如，对于第一个文档，我想得到如下内容

[('human',2), ('interface',2), ('computer',2)]

因为上面的每个标记在每个文档中出现两次

model = TfidfModel(corpus)

第二次

[('survey',2), ('user',3), ('computer',2),('system',3), ('response',2),('time',2)]

这个怎么样

from collections import Counter

documents = [...]
count_dict = [word_count(document) for filename in documents]

total = sum(count_dict, Counter())

我假设所有字符串都是不同的文档/文件。您可以进行相关的更改。此外，还对代码进行了更改

多谢各位。但是，对不起，我不明白。什么是“文件”？文件=文本？在count_dict=[word_count（document）表示文档中的文件名]，什么是“document”，word_count（列表压缩是针对文件名的）。最后，什么是计数器和计数器？对我来说，似乎你在计算所有文档中出现某个单词的所有时间，但也许我错了（？）。我要求的是一些不同的东西，一个单词出现的文档数。对答案进行了更改。这是一个拼写错误。修正码