Python 在维基百科语料库上训练LDA标记任意文章？_Python_Nltk_Gensim

Python 在维基百科语料库上训练LDA标记任意文章？

python

Python 在维基百科语料库上训练LDA标记任意文章？,python,nltk,gensim,Python,Nltk,Gensim,我按照gensim Python中的步骤在LDA模型上训练wikipedia，现在我想比较一下cnn.com上的任意文章和训练过的数据，接下来我需要做什么？假设本文是txt文件采取：然后通过使用获得相似性更新：要更准确地参考教程和文本文件，请执行以下操作： # Create a corpus from a list of texts common_dictionary = Dictionary(common_texts) common_corpus = [common_dictionar

我按照gensim Python中的步骤在LDA模型上训练wikipedia，现在我想比较一下cnn.com上的任意文章和训练过的数据，接下来我需要做什么？假设本文是txt文件

采取：

然后通过使用获得相似性

更新：

要更准确地参考教程和文本文件，请执行以下操作：

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)

# optional: print topics of your model
for topic in lda.print_topics(10):
    print(topic)

# load your CNN article from file
with open("cnn.txt", "r") as file:
    cnn = file.read()

# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]

cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]

unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic 
print(vector)

好的，但是我无法将这个例子与我在LDA培训过的维基百科中得到的结果联系起来。你究竟如何才能将其联系起来？这或多或少正是本教程末尾给出的示例：«像往常一样，可以使用经过训练的模型将新的、看不见的文档（简单的字数向量包）转换为LDA主题分布：

doc\u LDA=LDA[doc\u bow]

»对不起，我不是一个优秀的程序员，在上面提到的脚本中，我在哪里添加经过培训的wikipedia结果？我还可以制作cnn_article=加载的文本文件吗？或者我应该把它插入vecto/matrix？

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)

# optional: print topics of your model
for topic in lda.print_topics(10):
    print(topic)

# load your CNN article from file
with open("cnn.txt", "r") as file:
    cnn = file.read()

# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]

cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]

unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic 
print(vector)