Python 为什么Mallet LDA在Gensim版本没有'；T_Python_Nlp_Gensim_Lda_Mallet

Python 为什么Mallet LDA在Gensim版本没有'；T

python nlp

Python 为什么Mallet LDA在Gensim版本没有'；T,python,nlp,gensim,lda,mallet,Python,Nlp,Gensim,Lda,Mallet,我正在通过LDA模型进行文本分析；我听说Mallet实现是最好的。然而，当我将它与Gensim版本进行比较时，它似乎产生了非常糟糕的结果，所以我认为我可能做错了什么。有人能解释这种差异吗 import gensim from gensim.corpora.dictionary import Dictionary import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.model

我正在通过LDA模型进行文本分析；我听说Mallet实现是最好的。然而，当我将它与Gensim版本进行比较时，它似乎产生了非常糟糕的结果，所以我认为我可能做错了什么。有人能解释这种差异吗

import gensim
from gensim.corpora.dictionary import Dictionary
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim  


## Generate a toy corpus:

dog = list(np.repeat('dog', 500)) + list(np.repeat('cat', 20)) + list(np.repeat('bird', 20))
cat = list(np.repeat('dog', 20)) + list(np.repeat('cat', 500)) + list(np.repeat('bird', 20))
bird = list(np.repeat('dog', 20)) + list(np.repeat('cat', 20)) + list(np.repeat('bird', 500))

texts = [dog, cat, bird]

id2word = corpora.Dictionary(texts)

corpus = [id2word.doc2bow(i) for i in texts]

### Gensim model

lda_model = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                        id2word=id2word,
                                           num_topics=3, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)


vis = pyLDAvis.prepared_data_to_html(vis)

with open("LDA_output.html", "w") as file:
    file.write(vis)

这给出了以下关于主题的似是而非的推论：

但是，对于Mallet实现，情况却大不相同：

mallet_path = '/mallet-2.0.8/bin/mallet'

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=3, iterations=1000, workers = 4, id2word=id2word)

model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, id2word)

vis = pyLDAvis.prepared_data_to_html(vis)

with open("LDA_output.html", "w") as file:
    file.write(vis)

在这里，模型推断的主题之间几乎没有区别

现在，在我看来，我犯了一个基本的错误——可能是没有以正确的方式指定相关的模型参数。然而，我对这可能是什么感到困惑。如果有任何建议，我将不胜感激

这在非Gensim Mallet中绝对有效——经过四次迭代后，模型几乎完全收敛到每个主题一个单词。对三个文档使用四个线程不是一个好主意，但似乎不会改变结果。这看起来像是一个随机初始化。太好了，谢谢你的洞察力。知道怎么处理吗？这可能有关系。似乎

malletmodel2ldamodel

方法导致主题基本上无法区分。对于某些人来说，即使在修复之后，这个问题仍然存在。这是一个有用的发现，谢谢。看来Gensim的木槌包装器仍在使用中，因为更新什么也没做。真烦人！作为记录，我通过卸载并重新安装gensim解决了这个问题。Mallet现在可以工作了，尽管它的结果似乎与gensim实现没有什么不同——尽管这很可能是我的数据。这在非gensim Mallet中绝对有效——经过四次迭代后，模型几乎完全收敛到每个主题一个单词。对三个文档使用四个线程不是一个好主意，但似乎不会改变结果。这看起来像是一个随机初始化。太好了，谢谢你的洞察力。知道怎么处理吗？这可能有关系。似乎

malletmodel2ldamodel

方法导致主题基本上无法区分。对于某些人来说，即使在修复之后，这个问题仍然存在。这是一个有用的发现，谢谢。看来Gensim的木槌包装器仍在使用中，因为更新什么也没做。真烦人！作为记录，我通过卸载并重新安装gensim解决了这个问题。Mallet现在可以工作了，尽管它的结果与gensim实现似乎没有什么不同——尽管这很可能是我的数据。