R 从字符向量中提取和计数公共词对

R 从字符向量中提取和计数公共词对,r,regex-lookarounds,tm,qdap,R,Regex Lookarounds,Tm,Qdap,如何在字符向量中找到频繁的相邻单词对?例如,使用原油数据集,一些常见的对是“原油”、“石油市场”和“百万桶” 下面这个小示例的代码尝试识别频繁项,然后使用正向前瞻断言,计算这些频繁项后面紧跟着频繁项的次数。但是这次尝试失败了 对于如何创建一个数据框,在第一列(“对”)中显示公共对,在第二列(“计数”)中显示它们在文本中出现的次数,我们将不胜感激 library(qdap) library(tm) # from the crude data set, create a text fi

如何在字符向量中找到频繁的相邻单词对?例如,使用原油数据集,一些常见的对是“原油”、“石油市场”和“百万桶”

下面这个小示例的代码尝试识别频繁项,然后使用正向前瞻断言,计算这些频繁项后面紧跟着频繁项的次数。但是这次尝试失败了

对于如何创建一个数据框,在第一列(“对”)中显示公共对,在第二列(“计数”)中显示它们在文本中出现的次数,我们将不胜感激

   library(qdap)
   library(tm)

# from the crude data set, create a text file from the first three documents, then clean it

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10) 

# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")

# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
库(qdap)
图书馆(tm)
#从原始数据集中,从前三个文档创建一个文本文件,然后清除它

text首先,从以下位置修改初始
文本
列表:


text这里的一个想法是用bigrams创建一个新的语料库:

bigram或digram是一个令牌串中两个相邻元素的每个序列

提取二元图的递归函数:

bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }
“阿卜杜勒·阿齐兹”和“a未来”这两个大字是最常见的。您应该重新清理要删除的数据(of,the,…)。但这应该是一个良好的开端

操作后编辑评论: 如果您想在所有语料库中获得bigrams的频率,最基本的想法是计算循环中的bigrams,然后计算循环结果的频率。我希望添加更好的文本处理清理

res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14

res这将告诉您如何使用ngrams生成术语文档矩阵,然后您可以使用
rowSum
获取出现次数并选择频繁出现的次数:一个好答案,@agstudy。我的结果多次显示了一些bigram(例如,“case_team”,可能是因为您的解决方案在每个文档中查找并计算其bigram?是否有一个调整来显示整个语料库中bigram的频率,这是我真正想要的?还是我应该发布一个后续问题?>tail(sort(res),5)case\u team privilege\u review additional\u search\u terms case\u team 3 3 4 4 4 a答案不错,但@agstudy在数据框架中的结果更容易处理。除此之外,我希望整个语料库中都有大量的Bigram,而不仅仅是每个文档。请参阅我对另一个答案的评论。@律师我的结果也在数据框中。您在
总计
列中还有整个语料库中的Bigram计数。@律师我非常感谢您的反馈,因为我仍然不理解您的评论以及我提供的解决方案有什么问题。我没有暗示您的答案是w错了;我说它是好的。我两次都跑了,更喜欢另一个答案的结果。没什么深奥的;没什么恶意的;只是不能在两个答案上共享复选标记,所以我不得不选择。为了它的价值,我对你的答案投了更高的票。
BigramTokenizer <- function(x) { 
  unlist(
    lapply(ngrams(words(x), 2), paste, collapse = " "), 
    use.names = FALSE
  ) 
}
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
library(dplyr)
data.frame(inspect(tdm)) %>% 
  add_rownames() %>% 
  mutate(total = rowSums(.[,-1])) %>% 
  arrange(desc(total))
#Source: local data frame [272 x 5]
#
#             rowname X1 X2 X3 total
#1          crude oil  2  0  1     3
#2            mln bpd  0  3  0     3
#3         oil prices  0  3  0     3
#4       cut contract  2  0  0     2
#5        demand opec  0  2  0     2
#6        dlrs barrel  2  0  0     2
#7    effective today  1  0  1     2
#8  emergency meeting  0  2  0     2
#9      oil companies  1  1  0     2
#10      oil industry  0  2  0     2
#..               ... .. .. ..   ...
bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }
res <- unlist(lapply(crude,function(x){

  x <- tm::removeNumbers(tolower(x))
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  freqs <- table(bigram(strsplit(x," ")[[1]]))
  freqs[freqs>1]
}))


 as.data.frame(tail(sort(res),5))
                          tail(sort(res), 5)
reut-00022.xml.hold_a                      3
reut-00022.xml.in_the                      3
reut-00011.xml.of_the                      4
reut-00022.xml.a_futures                   4
reut-00010.xml.abdul_aziz                  5
res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14