如何找到n-gram的频率并使用R在wordcloud中可视化?

如何找到n-gram的频率并使用R在wordcloud中可视化?,r,nlp,R,Nlp,我有一个包含文本字符串的列,我想对其进行一些分析。我想知道最常用的词是什么,并将其可视化到wordcloud中。对于单字(Unigram),我已经设法做到了,但我无法使我的代码适用于n-gram(例如bigram、trigram)。这里我已经包括了我的Unigram代码。我愿意调整我的代码以使其正常工作,或者拥有一段完整的新代码。我该如何最好地处理这个问题 library(wordcloud) library(RColorBrewer) library(wordcloud2) library(t

我有一个包含文本字符串的列,我想对其进行一些分析。我想知道最常用的词是什么,并将其可视化到wordcloud中。对于单字(Unigram),我已经设法做到了,但我无法使我的代码适用于n-gram(例如bigram、trigram)。这里我已经包括了我的Unigram代码。我愿意调整我的代码以使其正常工作,或者拥有一段完整的新代码。我该如何最好地处理这个问题

library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(stringr)

#Delete special characters and lower text
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
df$text <- tolower(df$text)

#From df to Corpus
corpus <- Corpus(VectorSource(df))

#Remove english stopwords, 
stopwords<-c(stopwords("english"))
corpus <- tm_map(corpus, removeWords,stopwords)
rm(stopwords)

#Make term document matrix
tdm <- TermDocumentMatrix(corpus,control=list(wordLenths=c(1,Inf)))

#Make list of most frequent words
tdm_freq <- as.matrix(tdm) 
words <- sort(rowSums(tdm_freq),decreasing=TRUE) 
tdm_freq <- data.frame(word = names(words),freq=words)
rm(words)

#Make a wordcloud
wordcloud2(tdm_freq, size = 0.4, minSize = 10, gridSize =  0,
           fontFamily = 'Segoe UI', fontWeight = 'normal',
           color = 'red', backgroundColor = "white",
           minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE,
           rotateRatio = 0.4, shape = 'circle', ellipticity = 0.8,
           widgetsize = NULL, figPath = NULL, hoverFunction = NULL)
库(wordcloud)
图书馆(RColorBrewer)
图书馆(wordcloud2)
图书馆(tm)
图书馆(stringr)
#删除特殊字符和较低的文本

df$text将
Corpus
更改为
VCorpus
,这样标记就可以工作了

# Data
df <- data.frame(text = c("I have dataframe with a column I have dataframe with a column", 
                          "I would like to know what are the most I would like to know what are the most", 
                          "For single words (unigrams) I've managed to do so For single words (unigrams) I've managed to do so",
                          "Here I've included my code for the unigrams Here I've included my code for the unigrams"))

# VCorpus
corpus <- VCorpus(VectorSource(df))
funs <- list(stripWhitespace,
             removePunctuation,
             function(x) removeWords(x, stopwords("english")),
             content_transformer(tolower))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = funs)

# Tokenise data without requiring any particular package
ngram_token <-  function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)

# Pass into TDM control argument
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngram_token))
freq <- rowSums(as.matrix(tdm))
tdm_freq <- data.frame(term = names(freq), occurrences = freq)
tdm_freq


                               term occurrences
code unigrams         code unigrams           2
column dataframe   column dataframe           1
column like             column like           1
dataframe column   dataframe column           2
included code         included code           2
...

#数据
df