如何在R Studio中区分Bigram并将其合并到一个CSV文件中_R_Csv_Nlp_Rstudio

如何在R Studio中区分Bigram并将其合并到一个CSV文件中

r csv nlp

如何在R Studio中区分Bigram并将其合并到一个CSV文件中,r,csv,nlp,rstudio,R,Csv,Nlp,Rstudio,好的，我试着让R读句子，拉出复句，把所有复句合并成一个csv。现在我有了一个代码，可以为一句话画出一个大字： sentence=gsub('[[:punct:]]','', sentence) sentence=gsub('[[:cntrl:]]','', sentence) sentence=gsub('\\d+','', sentence) sentence=tolower(sentence) words<- strsplit(sentence, "\\

好的，我试着让R读句子，拉出复句，把所有复句合并成一个csv。现在我有了一个代码，可以为一句话画出一个大字：

sentence=gsub('[[:punct:]]','', sentence)
    sentence=gsub('[[:cntrl:]]','', sentence)
    sentence=gsub('\\d+','', sentence)
    sentence=tolower(sentence)
    words<- strsplit(sentence, "\\s+")[[1]]
    New=NULL
    for(i in 1:length(words)-1){ 
      New[i]=paste(words[i],words[i+1])     
  }
New=as.matrix(New)
colnames(New)<-"Bigrams"

句子=gsub（'[:punct:]'，''，句子）
句子=gsub（'[:cntrl:]'，''，句子）
句子=gsub（'\\d+'，''，句子）
句子=tolower（句子）
words不是直接的答案，但您可能会发现使用tm
和RWeka
的内置功能更简单：
library(RWeka)   # for NGramTokenizer(...)
library(tm)
# sample data
data <- data.frame(text=c("This is some text.",
                          "This is some other text.",
                          "This is some punctuation; and some more, and more...",
                          "These are some numbers: 1,2,3,4, five."))

doc  <- PlainTextDocument(data$text)
doc  <- removeNumbers(doc)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)), 
                          control = list(tokenize = BigramTokenizer))
result <- rownames(tdm)
result
#  [1] "and more"         "and some"         "are some"         "is some"         
#  [5] "more and"         "numbers five"     "other text"       "punctuation and" 
#  [9] "some more"        "some numbers"     "some other"       "some punctuation"
# [13] "some text"        "these are"        "this is"         

谢谢你的回复。我在网上看到一些人这样做，但是，我无法让RWeka在我的电脑上工作。你知道为什么会这样吗？这是我收到的错误消息：“rJava”的loadNamespace（）中的错误：.onLoad失败，详细信息：call:fun（libname，pkgname）错误：无法从注册表确定JAVA_HOME错误：“RWeka”的包或命名空间加载失败请参阅我对不使用RWeka的方法的编辑。至于安装这个包，我的建议是确保您首先安装了最新版本的R和最新版本的Java运行时。看起来安装程序在您的系统上根本找不到Java运行时？？
library(RWeka)   # for NGramTokenizer(...)
library(tm)
# sample data
data <- data.frame(text=c("This is some text.",
                          "This is some other text.",
                          "This is some punctuation; and some more, and more...",
                          "These are some numbers: 1,2,3,4, five."))

doc  <- PlainTextDocument(data$text)
doc  <- removeNumbers(doc)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)), 
                          control = list(tokenize = BigramTokenizer))
result <- rownames(tdm)
result
#  [1] "and more"         "and some"         "are some"         "is some"         
#  [5] "more and"         "numbers five"     "other text"       "punctuation and" 
#  [9] "some more"        "some numbers"     "some other"       "some punctuation"
# [13] "some text"        "these are"        "this is"         

bigrams <- function(text){
  word.vec <- strsplit(text, "\\s+")[[1]]
  sapply(1:(length(word.vec)-1), function(x)paste(word.vec[x], word.vec[x+1]))
}
doc  <- PlainTextDocument(data$text)
doc  <- removeNumbers(doc)
doc  <- removePunctuation(doc)
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)), 
                          control = list(tokenize = bigrams))
result.2 <- rownames(tdm)

identical(result,result.2)
# [1] TRUE