使用R清理文本数据_R - Fatal编程技术网

使用R清理文本数据

使用R清理文本数据,r,R,我有一个超过100列和100万行的数据框。一列是文本数据。文本数据列包含巨大的句子。我已经写了一个代码来清理数据，但它没有清理。我想删除所有的停止词，“the”，“you”，“like”，“for”等等 scorel=function（句子、位置词、.progress='none'） { 需要（plyr）要求（stringr）分数=圈数（句子、功能（句子、位置词） { #用R的正则表达式驱动的全局替换词gsub（）清理句子：句子=gsub（'[:punct:]'，''，句子）句子=gsu

我有一个超过100列和100万行的数据框。一列是文本数据。文本数据列包含巨大的句子。我已经写了一个代码来清理数据，但它没有清理。我想删除所有的停止词，“the”，“you”，“like”，“for”等等

scorel=function（句子、位置词、.progress='none'）
{
需要（plyr）
要求（stringr）
分数=圈数（句子、功能（句子、位置词）
{
#用R的正则表达式驱动的全局替换词gsub（）清理句子：
句子=gsub（'[:punct:]'，''，句子）
句子=gsub（'[:cntrl:]'，''，句子）
句子=gsub（'\\d+'，''，句子）
句子=gsub（“@\\w+*”，“句子”）
#并转换为小写：
句子=tolower（句子）
#拆分为单词。str_split在stringr包中
word.list=str_split（句子“\\s+”）
words=unlist（word.list）
#将我们的单词与积极和消极词汇词典进行比较
pos.matches=匹配（单词，pos.words）
#match（）返回匹配项或NA的位置
#我们只需要一个真/假：
#pos.matches=！is.na（pos.matches）
pos.matches=！is.na（pos.matches）
#非常方便的是，TRUE/FALSE将被sum（）视为1/0：
#分数=总和（位置匹配）
分数=总和（位置匹配）
返回（分数）
}，#pos.words，neg.words，.progress=.progress）
位置词，.progress=.progress）
scores.df=data.frame（分数=分数，文本=句子）
返回（scores.df）
}
数据使用tm
包，如下所示：
corpus <- Corpus(VectorSource(sentence)) # Convert input data to corpus
corpus <- tm_map(corpus, removeWords, stopwords('english')) # Remove stop word using tm package
dataframe<-data.frame(text=unlist(sapply(corpus, `[`, "content")), 
                  stringsAsFactors=F) # Convert data back to data frame from corpus
sentence<-as.character(dataframe)

> sentence=c('this is an best example','A person is nice')
> sentence
[1] "this is an best example" "A person is nice"       
> corpus <- Corpus(VectorSource(sentence))
> corpus <- tm_map(corpus, removeWords, stopwords('english'))
> dataframe<-data.frame(text=unlist(sapply(corpus, `[`, "content")), 
+                       stringsAsFactors=F)
> sentence<-as.character(dataframe)
> sentence
[1] "c(\"   best example\", \"A person  nice\")"

语料库您知道tm软件包吗？在这个链接中，我们必须转换为语料库，然后删除停止词。。。是否有任何方法可以转换回数据帧并将其传递给上面的函数。@玉兰油请检查我提供的答案。您可以直接将建议的代码复制到问题中提到的代码中。
> sentence=c('this is an best example','A person is nice')
> sentence
[1] "this is an best example" "A person is nice"       
> corpus <- Corpus(VectorSource(sentence))
> corpus <- tm_map(corpus, removeWords, stopwords('english'))
> dataframe<-data.frame(text=unlist(sapply(corpus, `[`, "content")), 
+                       stringsAsFactors=F)
> sentence<-as.character(dataframe)
> sentence
[1] "c(\"   best example\", \"A person  nice\")"