Python Gensim主题打印错误/问题

Python Gensim主题打印错误/问题,python,topic-modeling,gensim,Python,Topic Modeling,Gensim,全部, 这是我在上个月回复的帖子。我试图在gensim中打印LSI主题,结果非常糟糕。这是我的密码: try: from gensim import corpora, models except ImportError as err: print err class LSI: def topics(self, corpus): tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[c

全部,

这是我在上个月回复的帖子。我试图在gensim中打印LSI主题,结果非常糟糕。这是我的密码:

try:
    from gensim import corpora, models
except ImportError as err:
    print err

class LSI:
    def topics(self, corpus):
        tfidf = models.TfidfModel(corpus)
        corpus_tfidf = tfidf[corpus]
        dictionary = corpora.Dictionary(corpus)
        lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
        print lsi.show_topics()

if __name__ == '__main__':
    data = '../data/data.txt'
    corpus = corpora.textcorpus.TextCorpus(data)
    LSI().topics(corpus)
这会将以下内容打印到控制台

-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)" + ......
我希望能够像@2er0那样打印出主题,但我得到的结果是这样的。请看下面,注意打印的第二项是一个元组,我不知道它来自哪里。data.txt是一个包含多个段落的文本文件。仅此而已


任何关于这一点的想法都会很棒!Adam

它看起来很难看,但这确实起到了作用(只是一种纯粹基于字符串的方法):

上述产出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))
如果没有,您可以尝试lsi.print_topic(i)而不是lsi.show_topics()


它看起来很难看,但这确实起到了作用(只是一种纯粹基于字符串的方法):

上述产出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))
如果没有,您可以尝试lsi.print_topic(i)而不是lsi.show_topics()


要回答为什么LSI主题是元组而不是单词,请检查输入语料库

它是从通过
corpus=[dictionary.doc2bow(text)for text in text]
转换为语料库的文档列表中创建的吗?

因为如果它不是,而你只是从序列化语料库中阅读它,而没有阅读字典,那么你就不会得到主题输出中的单词

下面是我的代码,用加权词打印出主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i
上述产出:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"

要回答为什么LSI主题是元组而不是单词,请检查输入语料库

它是从通过
corpus=[dictionary.doc2bow(text)for text in text]
转换为语料库的文档列表中创建的吗?

因为如果它不是,而你只是从序列化语料库中阅读它,而没有阅读字典,那么你就不会得到主题输出中的单词

下面是我的代码,用加权词打印出主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i
上述产出:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"

嘿@2er0。非常感谢您的回答。在你上面的回答中,我得到了一些数字,比如“(5,1)”,而我应该得到实际的主题词。你知道这是为什么吗?你能打印完整的代码并告诉我你作为
语料库加载了哪些文档吗。我直觉上觉得你的语料库是这样的:
[(0,1),(2,2),(3,1),(4,1)]
没有像[(0,'dog'),(2,'the'),(3,'ate'),(4,'cat')]这样的字典。非常感谢您的回答。在你上面的回答中,我得到了一些数字,比如“(5,1)”,而我应该得到实际的主题词。你知道这是为什么吗?你能打印完整的代码并告诉我你作为
语料库加载了哪些文档吗。我直觉上觉得你的语料库是这样的:
[(0,1),(2,2),(3,1),(4,1)]
没有像[(0,'dog'),(2,'the'),(3,'ate'),(4,'cat')这样的字典