在R中使用LDA和tm进行文本分析

在R中使用LDA和tm进行文本分析,r,tm,lda,R,Tm,Lda,嘿,伙计们,我在进行LDA时有点麻烦,因为某种原因,一旦我准备好进行分析,我就会出错。我会尽我最大的努力完成我正在做的事情。不幸的是,我无法提供数据,因为我使用的数据是专有数据 dataset创建一个最小的可复制示例应该不会那么难。比如说 library(tm) library(topicmodels) raw <- c("hello","","goodbye") tm <- Corpus(VectorSource(raw)) dtm <- DocumentTermMatri

嘿,伙计们,我在进行LDA时有点麻烦,因为某种原因,一旦我准备好进行分析,我就会出错。我会尽我最大的努力完成我正在做的事情。不幸的是,我无法提供数据,因为我使用的数据是专有数据


dataset创建一个最小的可复制示例应该不会那么难。比如说

library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))

dtm <- DocumentTermMatrix(tm)

LDA(dtm,4)

# Error in LDA(dtm, 4) : 
#   Each row of the input matrix needs to contain at least one non-zero entry

请花时间创建可复制的示例。通常在这样做的过程中,您会发现自己的错误,并且可以轻松地修复它。至少,它将帮助其他人更清楚地看到问题,并消除不必要的信息。

创建一个最小的可复制示例并不难。比如说

library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))

dtm <- DocumentTermMatrix(tm)

LDA(dtm,4)

# Error in LDA(dtm, 4) : 
#   Each row of the input matrix needs to contain at least one non-zero entry

请花时间创建可复制的示例。通常在这样做的过程中,您会发现自己的错误,并且可以轻松地修复它。至少,它会帮助其他人更清楚地看到问题,并消除不必要的信息。

每个人都知道我是@MrFlick粉丝,我会+1这个答案,但我不得不为OP辩护,说有时很难做出可重现的错误,只是因为你不确定是什么导致了错误。我有OP显示的最后一条错误消息,我不知道如何重现它。对我来说,它来自一个不同的命令,即summary(tdm)。但无论如何,是的,重复的例子对我们找到解决方案至关重要,所以我并不反对Flick先生的观点。每个人都知道我是@MrFlick的粉丝,我会+1这个答案,但我不得不为OP辩护说,有时很难做出重复的错误,只是因为你不确定是什么导致了错误。我有OP显示的最后一条错误消息,我不知道如何重现它。对我来说,它来自一个不同的命令,即summary(tdm)。但无论如何,是的,可复制的例子对我们找到解决方案至关重要,所以我并不反对Flick先生的观点。
createdtm <- function(x){
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus,PlainTextDocument)
docs <- tm_map(myCorpus,tolower)
docs <- tm_map(docs, removeWords, stopwords(kind="SMART"))
docs <- tm_map(docs, removeWords, c("the"," the","will","can","regards","need","thanks","please","http"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)}

predtm <- createdtm(post)
[[1]]
<<PlainTextDocument (metadata: 7)>>
Here text string


[[2]]
<<PlainTextDocument (metadata: 7)>>
Here another string
dtm <- DocumentTermMatrix(predtm)
inspect(dtm)


<<DocumentTermMatrix (documents: 14640, terms: 39972)>>
Non-/sparse entries: 381476/584808604
Sparsity           : 100%
Maximal term length: 86
Weighting          : term frequency (tf)

Docs           truclientrre truddy trudi trudy true truebegin truecontrol
              Terms
Docs           truecrypt truecryptas trueimage truely truethis trulibraryref
              Terms
Docs           trumored truncate truncated truncatememory truncates
              Terms
Docs           truncatetableinautonomoustrx truncating trunk trunkhyper
              Terms
Docs           trunking trunkread trunks trunkswitch truss trust trustashtml
              Terms
Docs           trusted trustedbat trustedclient trustedclients
              Terms
Docs           trustedclientsjks trustedclientspwd trustedpublisher
              Terms
Docs           trustedreviews trustedsignon trusting trustiv trustlearn
              Terms
Docs           trustmanager trustpoint trusts truststorefile truststorepass
              Terms
Docs           trusty truth truthfully truths tryd tryed tryig tryin tryng
run.lda <- LDA(dtm,4)
  Error in LDA(dtm, 4) : 
  Each row of the input matrix needs to contain at least one non-zero entry
rowTotals <- apply(dtm , 1, sum)
dtm.new   <- dtm[rowTotals> 0]
  Error in `[.simple_triplet_matrix`(dtm, rowTotals > 0) : 
  Logical vector subscripting disabled for this object.
library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))

dtm <- DocumentTermMatrix(tm)

LDA(dtm,4)

# Error in LDA(dtm, 4) : 
#   Each row of the input matrix needs to contain at least one non-zero entry
rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals>0,]
LDA(dtm, 4)

#A LDA_VEM topic model with 4 topics.