Python 3.x 狄利克雷分配到底是如何工作的?

Python 3.x 狄利克雷分配到底是如何工作的?,python-3.x,scikit-learn,nlp,latent-semantic-analysis,Python 3.x,Scikit Learn,Nlp,Latent Semantic Analysis,我有一些文本,我正在使用sklearn算法从文本中提取主题 我已经使用Keras将文本转换为序列,我正在这样做: from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation() X_topics = lda.fit_transform(X) X: print(X) # array([[0, 988, 233, 21, 42, 5436, ...], [0, 43

我有一些文本,我正在使用sklearn算法从文本中提取主题

我已经使用Keras将文本转换为序列,我正在这样做:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X

print(X)
#  array([[0, 988, 233, 21, 42, 5436, ...],
   [0, 43, 6526, 21, 566, 762, 12, ...]])
X_主题

print(X_topics)
#  array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
    7.91563331e-01],
   [5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
    3.22771464e-01]])

我的问题是,从
fit_transform
返回的确切内容是什么,我知道这应该是从文本中检测到的主要主题,但我无法将这些数字映射到索引,因此我无法看到这些序列的含义,我无法搜索对实际发生的情况的解释,因此,任何建议都将受到欢迎。

首先,一个一般性的解释——将LDiA视为一种聚类算法,默认情况下,它将根据文本中单词的频率确定10个质心,并且由于接近质心,它将对其中一些单词施加更大的权重。在这种情况下,每个质心代表一个“主题”,主题未命名,但可以用构成每个簇的最主要词语来描述

因此,通常情况下,您对LDA所做的是:

  • 让它告诉你给定文本的10个(或其他)主题。
  • 让它告诉你新文本最接近哪个质心/主题
对于第二种情况,您的期望是LDiA将为10个集群/主题中的每一个输出新文本的“分数”。得分最高的索引是新文本所属的集群/主题的索引

我更喜欢gensim.models.LdaMulticore,但因为您使用了sklearn.decomposition.LatentDirichletAllocation,所以我将使用它

下面是一些示例代码(从中提取),它贯穿了这个过程

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(vectorizedX)

现在,让我们尝试一个新文本:

testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)

Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05      ,
        0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])

#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology


似乎很合理-示例文本是关于教育的,而第一个主题的单词cloud是关于教育的

下面的图片来自另一个数据集(火腿vs垃圾短信,所以只有两个可能的主题),我使用PCA将其简化为3维,但如果图片有帮助,这两个数据(来自不同角度的相同数据)可能会给出LDiA的大致情况。(图表来自潜在判别分析与LDiA,但表示仍然相关)

虽然LDiA是一种无监督的方法,但要在业务上下文中实际使用它,您可能至少需要手动干预,以给出对您的上下文有意义的主题名称。e、 g.为新闻聚合网站上的故事指定主题区域,从[‘商业’、‘体育’、‘娱乐’等]中进行选择

为了进一步研究,可能需要进行如下操作:

首先,一个一般性的解释-将LDiA视为一种聚类算法,默认情况下,它将根据文本中单词的频率确定10个质心,并且由于接近质心,它将对其中一些单词施加比其他单词更大的权重。在这种情况下,每个质心代表一个“主题”,主题未命名,但可以用构成每个簇的最主要词语来描述

因此,通常情况下,您对LDA所做的是:

  • 让它告诉你给定文本的10个(或其他)主题。
  • 让它告诉你新文本最接近哪个质心/主题
对于第二种情况,您的期望是LDiA将为10个集群/主题中的每一个输出新文本的“分数”。得分最高的索引是新文本所属的集群/主题的索引

我更喜欢gensim.models.LdaMulticore,但因为您使用了sklearn.decomposition.LatentDirichletAllocation,所以我将使用它

下面是一些示例代码(从中提取),它贯穿了这个过程

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(vectorizedX)

现在,让我们尝试一个新文本:

testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)

Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05      ,
        0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])

#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology


似乎很合理-示例文本是关于教育的,而第一个主题的单词cloud是关于教育的

下面的图片来自另一个数据集(火腿vs垃圾短信,所以只有两个可能的主题),我使用PCA将其简化为3维,但如果图片有帮助,这两个数据(来自不同角度的相同数据)可能会给出LDiA的大致情况。(图表来自潜在判别分析与LDiA,但表示仍然相关)

虽然LDiA是一种无监督的方法,但要在业务上下文中实际使用它,您可能至少需要手动干预,以给出对您的上下文有意义的主题名称。e、 g.为新闻聚合网站上的故事指定主题区域,从[‘商业’、‘体育’、‘娱乐’等]中进行选择

为了进一步研究,可能需要进行如下操作:

关于如何命名主题(分析热门主题词?)的补充内容也很有帮助。我无法重现打印每个主题的最后几行代码,如果您可以编辑使其可复制,那将是非常好的。我很抱歉-我在撰写此答案时一定删除了示例中的一些代码-现在应该很好。添加一个关于如何命名主题的内容(分析热门主题词?)也会很有帮助。我无法复制打印每个主题的最后一行代码,如果您可以编辑使其可复制,那将是非常好的。我很抱歉,在撰写此答案时,我必须删除示例中的一些代码。现在应该很好