在R中使用LDA和tm进行文本分析
嘿,伙计们,我在进行LDA时有点麻烦,因为某种原因,一旦我准备好进行分析,我就会出错。我会尽我最大的努力完成我正在做的事情。不幸的是,我无法提供数据,因为我使用的数据是专有数据在R中使用LDA和tm进行文本分析,r,tm,lda,R,Tm,Lda,嘿,伙计们,我在进行LDA时有点麻烦,因为某种原因,一旦我准备好进行分析,我就会出错。我会尽我最大的努力完成我正在做的事情。不幸的是,我无法提供数据,因为我使用的数据是专有数据 dataset创建一个最小的可复制示例应该不会那么难。比如说 library(tm) library(topicmodels) raw <- c("hello","","goodbye") tm <- Corpus(VectorSource(raw)) dtm <- DocumentTermMatri
dataset创建一个最小的可复制示例应该不会那么难。比如说
library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))
dtm <- DocumentTermMatrix(tm)
LDA(dtm,4)
# Error in LDA(dtm, 4) :
# Each row of the input matrix needs to contain at least one non-zero entry
请花时间创建可复制的示例。通常在这样做的过程中,您会发现自己的错误,并且可以轻松地修复它。至少,它将帮助其他人更清楚地看到问题,并消除不必要的信息。创建一个最小的可复制示例并不难。比如说
library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))
dtm <- DocumentTermMatrix(tm)
LDA(dtm,4)
# Error in LDA(dtm, 4) :
# Each row of the input matrix needs to contain at least one non-zero entry
请花时间创建可复制的示例。通常在这样做的过程中,您会发现自己的错误,并且可以轻松地修复它。至少,它会帮助其他人更清楚地看到问题,并消除不必要的信息。每个人都知道我是@MrFlick粉丝,我会+1这个答案,但我不得不为OP辩护,说有时很难做出可重现的错误,只是因为你不确定是什么导致了错误。我有OP显示的最后一条错误消息,我不知道如何重现它。对我来说,它来自一个不同的命令,即summary(tdm)。但无论如何,是的,重复的例子对我们找到解决方案至关重要,所以我并不反对Flick先生的观点。每个人都知道我是@MrFlick的粉丝,我会+1这个答案,但我不得不为OP辩护说,有时很难做出重复的错误,只是因为你不确定是什么导致了错误。我有OP显示的最后一条错误消息,我不知道如何重现它。对我来说,它来自一个不同的命令,即summary(tdm)。但无论如何,是的,可复制的例子对我们找到解决方案至关重要,所以我并不反对Flick先生的观点。
createdtm <- function(x){
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus,PlainTextDocument)
docs <- tm_map(myCorpus,tolower)
docs <- tm_map(docs, removeWords, stopwords(kind="SMART"))
docs <- tm_map(docs, removeWords, c("the"," the","will","can","regards","need","thanks","please","http"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)}
predtm <- createdtm(post)
[[1]]
<<PlainTextDocument (metadata: 7)>>
Here text string
[[2]]
<<PlainTextDocument (metadata: 7)>>
Here another string
dtm <- DocumentTermMatrix(predtm)
inspect(dtm)
<<DocumentTermMatrix (documents: 14640, terms: 39972)>>
Non-/sparse entries: 381476/584808604
Sparsity : 100%
Maximal term length: 86
Weighting : term frequency (tf)
Docs truclientrre truddy trudi trudy true truebegin truecontrol
Terms
Docs truecrypt truecryptas trueimage truely truethis trulibraryref
Terms
Docs trumored truncate truncated truncatememory truncates
Terms
Docs truncatetableinautonomoustrx truncating trunk trunkhyper
Terms
Docs trunking trunkread trunks trunkswitch truss trust trustashtml
Terms
Docs trusted trustedbat trustedclient trustedclients
Terms
Docs trustedclientsjks trustedclientspwd trustedpublisher
Terms
Docs trustedreviews trustedsignon trusting trustiv trustlearn
Terms
Docs trustmanager trustpoint trusts truststorefile truststorepass
Terms
Docs trusty truth truthfully truths tryd tryed tryig tryin tryng
run.lda <- LDA(dtm,4)
Error in LDA(dtm, 4) :
Each row of the input matrix needs to contain at least one non-zero entry
rowTotals <- apply(dtm , 1, sum)
dtm.new <- dtm[rowTotals> 0]
Error in `[.simple_triplet_matrix`(dtm, rowTotals > 0) :
Logical vector subscripting disabled for this object.
library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))
dtm <- DocumentTermMatrix(tm)
LDA(dtm,4)
# Error in LDA(dtm, 4) :
# Each row of the input matrix needs to contain at least one non-zero entry
rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals>0,]
LDA(dtm, 4)
#A LDA_VEM topic model with 4 topics.