Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/305.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在维基百科语料库上训练LDA标记任意文章?_Python_Nltk_Gensim - Fatal编程技术网

Python 在维基百科语料库上训练LDA标记任意文章?

Python 在维基百科语料库上训练LDA标记任意文章?,python,nltk,gensim,Python,Nltk,Gensim,我按照gensim Python中的步骤在LDA模型上训练wikipedia,现在我想比较一下cnn.com上的任意文章和训练过的数据,接下来我需要做什么?假设本文是txt文件 采取: 然后通过使用获得相似性 更新: 要更准确地参考教程和文本文件,请执行以下操作: # Create a corpus from a list of texts common_dictionary = Dictionary(common_texts) common_corpus = [common_dictionar

我按照gensim Python中的步骤在LDA模型上训练wikipedia,现在我想比较一下cnn.com上的任意文章和训练过的数据,接下来我需要做什么?假设本文是txt文件

采取:

然后通过使用获得相似性

更新:

要更准确地参考教程和文本文件,请执行以下操作:

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)

# optional: print topics of your model
for topic in lda.print_topics(10):
    print(topic)

# load your CNN article from file
with open("cnn.txt", "r") as file:
    cnn = file.read()

# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]

cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]

unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic 
print(vector)

好的,但是我无法将这个例子与我在LDA培训过的维基百科中得到的结果联系起来。你究竟如何才能将其联系起来?这或多或少正是本教程末尾给出的示例:«像往常一样,可以使用经过训练的模型将新的、看不见的文档(简单的字数向量包)转换为LDA主题分布:
doc\u LDA=LDA[doc\u bow]
»对不起,我不是一个优秀的程序员,在上面提到的脚本中,我在哪里添加经过培训的wikipedia结果?我还可以制作cnn_article=加载的文本文件吗?或者我应该把它插入vecto/matrix?
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)

# optional: print topics of your model
for topic in lda.print_topics(10):
    print(topic)

# load your CNN article from file
with open("cnn.txt", "r") as file:
    cnn = file.read()

# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]

cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]

unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic 
print(vector)