文本返回到tm包中的R对象_R_Text Mining_Tm_Corpus

文本返回到tm包中的R对象

文本返回到tm包中的R对象,r,text-mining,tm,corpus,R,Text Mining,Tm,Corpus,我是tm软件包的新手，非常感谢您的帮助。我有一堆文章，我从中提取了不必要的符号和停止词，我使用tm包的各种功能完成了这些工作（见下文）。最后，我只剩下201个文档，其中包含我需要的干净字符串，但是，它不是一个R对象，而是一个VCorpus对象。如何将这些处理过的文档全部缝合到一个文本文件中，使其成为一个长字符串换句话说，如何将VCorpus对象转换为数据帧、列表或其他R对象 corpus <-iconv(posts$message, "latin1", "ASCII", sub="")

我是

tm

软件包的新手，非常感谢您的帮助。我有一堆文章，我从中提取了不必要的符号和停止词，我使用

tm

包的各种功能完成了这些工作（见下文）。最后，我只剩下201个文档，其中包含我需要的干净字符串，但是，它不是一个R对象，而是一个

VCorpus

对象。如何将这些处理过的文档全部缝合到一个文本文件中，使其成为一个长字符串

换句话说，如何将VCorpus对象转换为数据帧、列表或其他R对象

corpus <-iconv(posts$message, "latin1", "ASCII", sub="")

corpus <- Corpus(VectorSource(docs))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)

#remove speical characters for emails

for(j in seq(corpus))   
{   
  corpus[[j]] <- gsub("/", " ", corpus[[j]])   
  corpus[[j]] <- gsub("@", " ", corpus[[j]])   
  corpus[[j]] <- gsub("\\|", " ", corpus[[j]])   
}   


library(SnowballC)

corpus <- tm_map(corpus, stemDocument)  

#remove common English stopwords 
docs <- tm_map(docs, removeWords, stopwords("english"))

#remove words that will be common in our given context
docs <- tm_map(docs, removeWords, c("department", "email", "job", "fresher", "internship"))

#removeUrls
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

corpus <- tm_map(corpus, removeURL)

> corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 201

语料库语料库是纯文本文档的列表。如果要将所有内容提取为字符数组，可以使用sapply
和content
在列表上循环提取所有内容
使用测试
# library(tm)
data("crude")
x <- tm_map(crude, stemDocument, lazy = TRUE)
x <- tm_map(x, content_transformer(tolower))

xx <- sapply(x, content)
str(xx)

#库（tm）
数据（“原油”）
x当我用我自己的数据替换你的数据时，我收到以下错误：error in use method（“meta”，x）：没有适用于“character”类对象的“meta”方法。
那么你应该在你的问题中提供一个可复制的示例，这样我就可以看到你的语料库与我创建的样本有何不同。看起来您的removeURL
应该包装在content\u transformer
中。我建议您阅读后一个函数的文档。不要将您的问题编辑为完全不同的问题。打开一个新问题。