Python 带Mallet困惑的Gensim主题建模_Python_Gensim_Topic Modeling_Mallet_Perplexity

Python 带Mallet困惑的Gensim主题建模

python

Python 带Mallet困惑的Gensim主题建模,python,gensim,topic-modeling,mallet,perplexity,Python,Gensim,Topic Modeling,Mallet,Perplexity,我是哈佛图书馆图书标题和主题的建模专家我使用Gensim Mallet包装器与Mallet的LDA一起建模。当我试图获得连贯性和困惑度的值以查看模型有多好时，困惑度无法计算，以下例外。如果我使用Gensim的内置LDA模型而不是Mallet，我不会得到相同的错误。我的语料库保存了700多个文档，长度高达50个单词，平均20个。所以文件很短以下是我的代码的相关部分： # TOPIC MODELING from gensim.models import CoherenceModel nu

我是哈佛图书馆图书标题和主题的建模专家

我使用Gensim Mallet包装器与Mallet的LDA一起建模。当我试图获得连贯性和困惑度的值以查看模型有多好时，困惑度无法计算，以下例外。如果我使用Gensim的内置LDA模型而不是Mallet，我不会得到相同的错误。我的语料库保存了700多个文档，长度高达50个单词，平均20个。所以文件很短

以下是我的代码的相关部分：

# TOPIC MODELING

from gensim.models import CoherenceModel
num_topics = 50

# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

困惑：-47.91929228302663

一致性得分：0.28852857563541856

LDA给出的分数没有问题。现在我用木槌模拟同一袋单词

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)

# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

一致性得分：0.5994123896865993

然后我询问困惑值，得到下面的警告和NaN值

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site packages/gensim/models/ldamodel.py:1108: RuntimeWarning:乘法分数中遇到无效值+= np.和（（自身预计值-_lambda）*Elogbeta）

困惑：楠

/app/app-py3/lib/python3.5/site packages/gensim/models/ldamodel.py:1109: RuntimeWarning:在减法分数中遇到无效值+= np.sum（gammaln（_lambda）-gammaln（self.eta））

我意识到这是一个非常特定于Gensim的问题，需要对该功能有更深入的了解： gensim.models.wrappers.ldamallet.malletmodel2ldamallet（ldamallet）

因此，如果您对警告和Gensim域有任何评论，我将不胜感激。

我不认为Mallet包装器实现了困惑功能。如中所述，困惑会显示到标准输出：

阿飞，马勒向斯图特表现出困惑——这对你来说够了吗？以编程方式捕获这些值也应该是可能的，但我还没有研究过这一点。希望Mallet也有一些API调用来进行困惑评估，但它肯定没有包含在包装器中

我只是在一个样本语料库上运行了它，而LL/token确实每隔这么多次就打印一次：

LL/代币：-9.45493

困惑=2^（-LL/token）=701.81

我给你几分钱

似乎在

lda\u model.log\u complexity（语料库）

中，您使用的语料库与用于培训的语料库相同。我可能会更幸运地拥有一套语料库

lda_model.log_困惑（corpus）不返回困惑。它返回“绑定”。如果你想把它变成困惑，请执行

np.exp2（-bound）

。我为此挣扎了一段时间：）

没有办法用木槌包装器来报告困惑

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))