我应该使用哪个gensim语料库类加载LDA转换的语料库python
如何从python的我应该使用哪个gensim语料库类加载LDA转换的语料库python,python,nlp,corpus,lda,gensim,Python,Nlp,Corpus,Lda,Gensim,如何从python的gensim加载LDA转换的语料库?我尝试过的: from gensim import corpora, models import numpy.random numpy.random.seed(10) doc0 = [(0, 1), (1, 1)] doc1 = [(0,1)] doc2 = [(0, 1), (1, 1)] doc3 = [(0, 3), (1, 1)] corpus = [doc0,doc1,doc2,doc3] dictionary = corpo
gensim
加载LDA转换的语料库?我尝试过的:
from gensim import corpora, models
import numpy.random
numpy.random.seed(10)
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]
corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')
# To access the tfidf fitted corpus i've saved i used corpora.MmCorpus.load()
corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda[corpus]
corpus_lda.save('x.corpus_lda')
for i,j in enumerate(corpus_lda):
print j, corpus[i]
上述代码将输出:
[(0, 0.54259038344543631), (1, 0.45740961655456358)] [(0, 1), (1, 1)]
[(0, 0.56718063124157458), (1, 0.43281936875842542)] [(0, 1)]
[(0, 0.54255407573666647), (1, 0.45744592426333358)] [(0, 1), (1, 1)]
[(0, 0.75229707773868093), (1, 0.2477029222613191)] [(0, 3), (1, 1)]
# [(<topic_number_from x.corpus_lda model>,
# <probability of this topic for this document>),
# (<topic# from lda model>, <prob of this top for this doc>)] [<document[i] from corpus>]
在
corpora.XCorpus
()中尝试了所有可能的类之后,我尝试使用BleiCorpus加载,似乎它生成了与保存的模型相同的输出,小数位数更少
>>> from gensim import corpora, models
>>> import numpy.random
>>> numpy.random.seed(10)
>>>
>>> doc0 = [(0, 1), (1, 1)]
>>> doc1 = [(0,1)]
>>> doc2 = [(0, 1), (1, 1)]
>>> doc3 = [(0, 3), (1, 1)]
>>> corpus = [doc0,doc1,doc2,doc3]
>>> dictionary = corpora.Dictionary(corpus)
>>>
>>> tfidf = models.TfidfModel(corpus)
>>> corpus_tfidf = tfidf[corpus]
>>>
>>> lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=3)
>>> corpus_lda = lda[corpus]
>>> corpus_lda.save('x.corpus_lda')
>>>
>>> for i,j in enumerate(corpus_lda):
... print j, corpus[i]
...
[(0, 0.15441373560695118), (1, 0.56498524668290762), (2, 0.28060101771014123)] [(0, 1), (1, 1)]
[(0, 0.59512220481946487), (1, 0.22817873367464175), (2, 0.17669906150589348)] [(0, 1)]
[(0, 0.52219543266162705), (1, 0.15449347037173339), (2, 0.32331109696663957)] [(0, 1), (1, 1)]
[(0, 0.83364632205849853), (1, 0.086514534997754619), (2, 0.079839142943746944)] [(0, 3), (1, 1)]
>>>
>>> lda_corpus = corpora.BleiCorpus.load('x.corpus_lda')
>>> for i,j in enumerate(lda_corpus):
... print j, corpus[i]
...
[(0, 0.154413735607), (1, 0.564985246683), (2, 0.280601017710)] [(0, 1), (1, 1)]
[(0, 0.595122204819), (1, 0.228178733675), (2, 0.176699061506)] [(0, 1)]
[(0, 0.522195432662), (1, 0.154493470372), (2, 0.323311096967)] [(0, 1), (1, 1)]
[(0, 0.833646322058), (1, 0.086514534998), (2, 0.079839142944)] [(0, 3), (1, 1)]
代码中还有更多问题 要以MatrixMarket格式保存语料库,您需要
corpora.MmCorpus.serialize('x.corpus_lda', corpus_lda)
这些文件是
您正在进行corpus\u tfidf
培训,但随后仅转换lda[corpus]
(无tfidf)。可以使用tfidf,也可以使用简单的单词包,但要始终如一地使用
corpora.MmCorpus.serialize('x.corpus_lda', corpus_lda)