使用pamlda在python中进行主题建模的奇怪输出_Python_Pandas_Lda_Topic Modeling

使用pamlda在python中进行主题建模的奇怪输出

python pandas

使用pamlda在python中进行主题建模的奇怪输出,python,pandas,lda,topic-modeling,Python,Pandas,Lda,Topic Modeling,我正在尝试在我的数据框架上做主题建模，它只由英语单词组成，你可以用任何文本代替它- dfi['clean_text'] Out[154]: 0 thank you for calling my name is gabrielle and... 1 your available my first name is was there you ... 2 good 3

我正在尝试在我的数据框架上做主题建模，它只由英语单词组成，你可以用任何文本代替它-

dfi['clean_text']
Out[154]: 
0        thank you for calling my name is gabrielle and...
1        your available my first name is was there you ...
2                                                    good 
3                                           go head sorry 
4        no go head i mean how do you want to pull my r...
                       
14676                              just the email is fine 
14677    okay great so then everything is process here ...
14678                         no thats it i appreciate it 
14679    yes and thank you very much we appreciated hav...
14680                                   thank you bye bye

我的模型-

#Pachinko Allocation Model
import tomotopy as tp
from pprint import pprint

model = tp.LDAModel(k=2, seed=1)  #k is the number of topics

for texts in dfi['clean_text']:
    model.add_doc(texts)

model.train(iter=100)

#Extracting the word distribution of a topic
for k in range(model.k):
    print(f"Topic {k}")
    pprint(model.get_topic_words(k, top_n=5))
Topic 0
[(' ', 0.2129271924495697),
 ('e', 0.08137548714876175),
 ('o', 0.0749373733997345),
 ('a', 0.07390690594911575),
 ('t', 0.06929121911525726)]
Topic 1
[(' ', 0.19975200295448303),
 ('e', 0.09751541167497635),
 ('t', 0.06939278542995453),
 ('i', 0.06373799592256546),
 ('o', 0.06239694356918335)]

但正如您在这里看到的，输出没有按主题显示字符串或单词，它只是出于某种奇怪的原因显示字母表。我是python新手，这里可能缺少一些东西。

您需要以某种方式对文本进行标记化，即确定将较长的文本字符串拆分为标记列表（本质上是单词）的规则。如果没有这种情况，迭代单个字符串（很可能是

tomotopy

在后台执行的操作）将返回一个字符列表。这就是为什么在示例输出中会看到带有单字母标记的主题

标记化本身就是一个巨大的主题，但作为一个最小的起点，您可以使用

text.strip（）.split（）

，如下所示：

for texts in dfi['clean_text']:
    model.add_doc(texts.strip().split()) # edited this line

Topic 0
[('go', 0.21682846546173096),
 ('name', 0.21682846546173096),
 ('to', 0.10895361006259918),
 ('no', 0.10895361006259918),
 ('r', 0.10895361006259918)]
Topic 1
[('you', 0.11457936465740204),
 ('my', 0.11457936465740204),
 ('head', 0.0765131339430809),
 ('is', 0.0765131339430809),
 ('first', 0.03844689577817917)]