Python 在维基百科语料库上训练LDA标记任意文章?
我按照gensim Python中的步骤在LDA模型上训练wikipedia,现在我想比较一下cnn.com上的任意文章和训练过的数据,接下来我需要做什么?假设本文是txt文件 采取: 然后通过使用获得相似性 更新: 要更准确地参考教程和文本文件,请执行以下操作:Python 在维基百科语料库上训练LDA标记任意文章?,python,nltk,gensim,Python,Nltk,Gensim,我按照gensim Python中的步骤在LDA模型上训练wikipedia,现在我想比较一下cnn.com上的任意文章和训练过的数据,接下来我需要做什么?假设本文是txt文件 采取: 然后通过使用获得相似性 更新: 要更准确地参考教程和文本文件,请执行以下操作: # Create a corpus from a list of texts common_dictionary = Dictionary(common_texts) common_corpus = [common_dictionar
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)
# optional: print topics of your model
for topic in lda.print_topics(10):
print(topic)
# load your CNN article from file
with open("cnn.txt", "r") as file:
cnn = file.read()
# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]
cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]
unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document
# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic
print(vector)
好的,但是我无法将这个例子与我在LDA培训过的维基百科中得到的结果联系起来。你究竟如何才能将其联系起来?这或多或少正是本教程末尾给出的示例:«像往常一样,可以使用经过训练的模型将新的、看不见的文档(简单的字数向量包)转换为LDA主题分布:
doc\u LDA=LDA[doc\u bow]
»对不起,我不是一个优秀的程序员,在上面提到的脚本中,我在哪里添加经过培训的wikipedia结果?我还可以制作cnn_article=加载的文本文件吗?或者我应该把它插入vecto/matrix?
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)
# optional: print topics of your model
for topic in lda.print_topics(10):
print(topic)
# load your CNN article from file
with open("cnn.txt", "r") as file:
cnn = file.read()
# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]
cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]
unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document
# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic
print(vector)