将主题映射回R中的原始数据帧
我已将excel中的数据读取到R中,数据由459行和3列组成将主题映射回R中的原始数据帧,r,R,我已将excel中的数据读取到R中,数据由459行和3列组成 library(openxlsx) datamg <- read.xlsx("GC1.xlsx",sheet=1,startRow = 1,colNames = TRUE,skipEmptyRows = TRUE) head(datamg,3) Q Themes1 Themes2 1 yes I believe i
library(openxlsx)
datamg <- read.xlsx("GC1.xlsx",sheet=1,startRow = 1,colNames =
TRUE,skipEmptyRows = TRUE)
head(datamg,3)
Q Themes1 Themes2
1 yes I believe it . Because the risk limits Nature of risk <NA>
2 Yes but a very low risk Other <NA>
3 worried about potential regulations Regulatory concerns <NA>
库(openxlsx)
datamg为了将它们映射回原始数据集,您必须向语料库和文档术语矩阵中的每个文档添加唯一标识符。由于您没有行id(或某种唯一键),因此我基于行号创建一个行id,并将其添加到原始数据集中:
library(dplyr)
library(tm)
library(topicmodels)
library(tidytext)
datamg$doc_id <- 1:nrow(datamg)
datamg <- datamg %>%
select(doc_id, Q) %>%
rename('text' = Q)
库(dplyr)
图书馆(tm)
库(topicmodels)
图书馆(tidytext)
datamg$doc\u id%
重命名('text'=Q)
我只保留这两列,并给它们命名为“doc_id”和“text”,因为将id附加到语料库时,tm包(DataframeSource函数)需要它
myCorpus1 <- Corpus(DataframeSource(datamg))
myCorpus1在datamg$TopicMapped
中应该是什么?数据框中有3行,输出中有2行。您没有显示第三行输出吗?@KenS。谢谢你的回复。datamg$TopicMapped意味着将主题1、主题2等分配给相应的行。请忽略主题1和主题2,因为根据我对第1(Q)列内容的理解,我试图手动分配主题。@CPak感谢您的回复。根据我之前的评论,请忽略主题1和主题2。datamd$TopicMapped应该最适合topicmodeling中确定的主题。
Q Themes1 Themes2 Topic Mapped
1 yes I believe it . Because the risk limits Nature of risk <NA>
2 Yes but a very low risk Other <NA>
3 worried about potential regulations Regulatory concerns <NA>
library(dplyr)
library(tm)
library(topicmodels)
library(tidytext)
datamg$doc_id <- 1:nrow(datamg)
datamg <- datamg %>%
select(doc_id, Q) %>%
rename('text' = Q)
myCorpus1 <- Corpus(DataframeSource(datamg))
document_topic <- as.data.frame(tidy(lda, matrix = "gamma"))
document_topic$document <- as.integer(document_topic$document)
document_topic <- document_topic %>%
group_by(document) %>%
top_n(1) %>%
ungroup()
df_join <- inner_join(datamg, document_topic, by = c("Q" = "document"))