Python 3.x 用countvectorizer训练的gensim ldamodel中的主题分布_Python 3.x_Gensim_Topic Modeling_Countvectorizer

Python 3.x 用countvectorizer训练的gensim ldamodel中的主题分布

python-3.x

Python 3.x 用countvectorizer训练的gensim ldamodel中的主题分布,python-3.x,gensim,topic-modeling,countvectorizer,Python 3.x,Gensim,Topic Modeling,Countvectorizer,我有一个任务是这样的： import gensim from sklearn.feature_extraction.text import CountVectorizer newsgroup_data = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time",

我有一个任务是这样的：

import gensim
from sklearn.feature_extraction.text import CountVectorizer

newsgroup_data = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

vect = CountVectorizer(stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
X = vect.fit_transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

我的任务是估计语料库中的LDA模型参数，找到10个主题和每个主题中最重要的10个单词的列表，我就是这样做的：

top10 = ldamodel.print_topics(num_topics=10, num_words=10)
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, 
              id2word=id_map, num_topics=10, minimum_probability=0)

通过了自动签名机。下一个任务是查找一个新文档的主题分布，我尝试如下操作：

new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
newX = vect.transform(new_doc)
newC = gensim.matutils.Sparse2Corpus(newX, documents_columns=False)
print(ldamodel.get_document_topics(newC))

然而，这只是返回

gensim.interfaces.TransformedCorpus

我还从文档中看到这样一句话：“你可以通过>>>doc\u lda=lda[doc\u bow]推断出新的、看不见的文档的主题分布”，但在这里也没有成功。感谢您的帮助。

继续深入研究，特别是针对gensim.interfaces.TransformedCorpus接口。据我所知，该接口指向我要求的主题/发行版，但我需要遍历它以查看值

topic_dist = ldamodel.get_document_topics(newC)
td=[]
for topic in topic_dis:
   td.append(topic)
td = td[0]

这就是诀窍。也可以使用

topic_dist = ldamodel[newC]