Python 具有先验主题词的潜在Dirichlet分配_Python_Scikit Learn_Nlp_Topic Modeling

Python 具有先验主题词的潜在Dirichlet分配

python scikit-learn nlp

Python 具有先验主题词的潜在Dirichlet分配,python,scikit-learn,nlp,topic-modeling,Python,Scikit Learn,Nlp,Topic Modeling,上下文我正在尝试使用from从一组文本中提取主题。除了找到/选择的主题词的质量外，这一点非常有效在作者的一篇文章中，作者描述了使用先前的主题词作为LDA的输入。他们手动选择4个主题以及与这些主题相关/属于这些主题的主要词语。对于这些单词，它们将关联主题的默认值设置为高，其他主题的默认值设置为0。对于所有主题（1），所有其他单词（不是为主题手动选择的）的值都相等。该值矩阵用作LDA的输入我的问题如何使用自定义的默认值矩阵（前面的主题词）作为输入，使用Scikit Learn中的Laten

上下文

我正在尝试使用from从一组文本中提取主题。除了找到/选择的主题词的质量外，这一点非常有效

在作者的一篇文章中，作者描述了使用先前的主题词作为LDA的输入。他们手动选择4个主题以及与这些主题相关/属于这些主题的主要词语。对于这些单词，它们将关联主题的默认值设置为高，其他主题的默认值设置为0。对于所有主题（1），所有其他单词（不是为主题手动选择的）的值都相等。该值矩阵用作LDA的输入

我的问题

如何使用自定义的默认值矩阵（前面的主题词）作为输入，使用Scikit Learn中的LatentDirichletAllocation模块创建类似的分析

（我知道有一个

topic\u word\u prior

参数，但它只需要一个浮点，而不是一个具有不同“默认值”的矩阵。）

编辑

解决方案

使用@Anis的帮助，我创建了原始模块的一个子类，并编辑了设置起始值矩阵的函数。对于您希望作为输入提供的所有先前主题词，它通过将值乘以该（先前）词的主题值来转换

组件

矩阵

代码如下：

# List with prior topic words as tuples
# (word index, [topic values])
prior_topic_words = []

# Example (word at index 3000 belongs to topic with index 0)
prior_topic_words.append(
    (3000, [(np.finfo(np.float64).max/4),0.,0.,0.,0.])
)

# Custom subclass for PTW-guided LDA
from sklearn.utils import check_random_state
from sklearn.decomposition._online_lda import _dirichlet_expectation_2d
class PTWGuidedLatentDirichletAllocation(LatentDirichletAllocation):

    def __init__(self, n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None, n_topics=None, ptws=None):
        super(PTWGuidedLatentDirichletAllocation, self).__init__(n_components, doc_topic_prior, topic_word_prior, learning_method, learning_decay, learning_offset, max_iter, batch_size, evaluate_every, total_samples, perp_tol, mean_change_tol, max_doc_update_iter, n_jobs, verbose, random_state, n_topics)
        self.ptws = ptws

    def _init_latent_vars(self, n_features):
        """Initialize latent variables."""

        self.random_state_ = check_random_state(self.random_state)
        self.n_batch_iter_ = 1
        self.n_iter_ = 0

        if self.doc_topic_prior is None:
            self.doc_topic_prior_ = 1. / self.n_topics
        else:
            self.doc_topic_prior_ = self.doc_topic_prior

        if self.topic_word_prior is None:
            self.topic_word_prior_ = 1. / self.n_topics
        else:
            self.topic_word_prior_ = self.topic_word_prior

        init_gamma = 100.
        init_var = 1. / init_gamma
        # In the literature, this is called `lambda`
        self.components_ = self.random_state_.gamma(
            init_gamma, init_var, (self.n_topics, n_features))

        # Transform topic values in matrix for prior topic words
        if self.ptws is not None:
            for ptw in self.ptws:
                word_index = ptw[0]
                word_topic_values = ptw[1]
                self.components_[:, word_index] *= word_topic_values

        # In the literature, this is `exp(E[log(beta)])`
        self.exp_dirichlet_component_ = np.exp(
            _dirichlet_expectation_2d(self.components_))

初始化与原始的

LatentDirichletAllocation

类相同，但现在您可以使用

ptws

参数提供先前的主题词。

查看源代码和文档后，在我看来，最简单的方法是子类

LatentDirichletAllocation

，只覆盖

\u init\u潜伏\u vars

方法。这是在

fit

中调用的方法，用于创建

组件

属性，该属性是用于分解的矩阵。通过重新实现此方法，您可以按您想要的方式设置它，特别是增加相关主题/功能的优先权重。您将在那里重新实现论文的初始化逻辑。

您是否尝试过手动编辑模型的组件矩阵的系数？在我看来，这就是你想要实现的。谢谢你的快速回复，这就是我想要弄明白的。（我不确定我必须/可以调整哪些（内部）属性，以及我可以在其中输入什么范围的值？在我看来，这似乎是模型的组件矩阵，因为它直接用于培训。你可以使用

model.components[I，j]=aij

设置主题i和功能j的值aij。我假设这应该在拟合模型之前发生？值的范围重要吗？（例如，我可以使用0、1和大的正浮点吗？）谢谢！我正在沿着这条线工作，如果找到解决方案，我会发布代码：）是的，初始化此矩阵确实不明显，我不能再进一步了，现在由你决定。最后一件事，看看_init_潜伏变量的原始实现，您将看到另一个名为

exp_dirichlet_component_

的矩阵，在您使用

components_

之后，您需要处理它，在计算

exp\u dirichlet\u component\uu

之前，我正在用ptw矩阵变换

components\uu

矩阵，所以应该考虑到这一点。现在测试我的实现，我会一直告诉您，当我尝试此解决方案时，我遇到了以下错误：RuntimeError:scikit learn估计器应始终在其init签名中指定其参数（无varargs）。构造函数（self，ptws=None，*args，**kwargs）不遵循此约定。@我更新了解决方案块中的代码。它现在应该通过你提到的测试。这样做的缺点是，代码的可读性变差，如果超类的默认值发生更改，它将不会自动接管这些值。但无论如何，它现在应该符合建议的标准。