使用pamlda在python中进行主题建模的奇怪输出
我正在尝试在我的数据框架上做主题建模,它只由英语单词组成,你可以用任何文本代替它-使用pamlda在python中进行主题建模的奇怪输出,python,pandas,lda,topic-modeling,Python,Pandas,Lda,Topic Modeling,我正在尝试在我的数据框架上做主题建模,它只由英语单词组成,你可以用任何文本代替它- dfi['clean_text'] Out[154]: 0 thank you for calling my name is gabrielle and... 1 your available my first name is was there you ... 2 good 3
dfi['clean_text']
Out[154]:
0 thank you for calling my name is gabrielle and...
1 your available my first name is was there you ...
2 good
3 go head sorry
4 no go head i mean how do you want to pull my r...
14676 just the email is fine
14677 okay great so then everything is process here ...
14678 no thats it i appreciate it
14679 yes and thank you very much we appreciated hav...
14680 thank you bye bye
我的模型-
#Pachinko Allocation Model
import tomotopy as tp
from pprint import pprint
model = tp.LDAModel(k=2, seed=1) #k is the number of topics
for texts in dfi['clean_text']:
model.add_doc(texts)
model.train(iter=100)
#Extracting the word distribution of a topic
for k in range(model.k):
print(f"Topic {k}")
pprint(model.get_topic_words(k, top_n=5))
Topic 0
[(' ', 0.2129271924495697),
('e', 0.08137548714876175),
('o', 0.0749373733997345),
('a', 0.07390690594911575),
('t', 0.06929121911525726)]
Topic 1
[(' ', 0.19975200295448303),
('e', 0.09751541167497635),
('t', 0.06939278542995453),
('i', 0.06373799592256546),
('o', 0.06239694356918335)]
但正如您在这里看到的,输出没有按主题显示字符串或单词,它只是出于某种奇怪的原因显示字母表。我是python新手,这里可能缺少一些东西。您需要以某种方式对文本进行标记化,即确定将较长的文本字符串拆分为标记列表(本质上是单词)的规则。如果没有这种情况,迭代单个字符串(很可能是
tomotopy
在后台执行的操作)将返回一个字符列表。这就是为什么在示例输出中会看到带有单字母标记的主题
标记化本身就是一个巨大的主题,但作为一个最小的起点,您可以使用text.strip().split()
,如下所示:
for texts in dfi['clean_text']:
model.add_doc(texts.strip().split()) # edited this line
返回
Topic 0
[('go', 0.21682846546173096),
('name', 0.21682846546173096),
('to', 0.10895361006259918),
('no', 0.10895361006259918),
('r', 0.10895361006259918)]
Topic 1
[('you', 0.11457936465740204),
('my', 0.11457936465740204),
('head', 0.0765131339430809),
('is', 0.0765131339430809),
('first', 0.03844689577817917)]