R 计算两个字符串中的常用词_R_String_Text Mining_Data Analysis

R 计算两个字符串中的常用词

r string

R 计算两个字符串中的常用词,r,string,text-mining,data-analysis,R,String,Text Mining,Data Analysis,我有两条线： a <- "Roy lives in Japan and travels to Africa" b <- "Roy travels Africa with this wife" a也许，使用intersect和stru-extract 对于多个字符串，您可以将它们作为列表或向量 vec1 <- c(a,b) Reduce(`intersect`,str_extract_all(vec1, "\\w+")) #[1] "Roy" "travels"

我有两条线：

a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"

a也许，使用intersect
和stru-extract
对于多个字符串
，您可以将它们作为列表
或向量

 vec1 <- c(a,b)
 Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
 #[1] "Roy"     "travels" "Africa" 

计数：
 length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
 #[1] 3

或使用base R

  Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
  #[1] "Roy"     "travels" "Africa" 

您可以使用和从base
库：
> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3

>a b a_分割b_分割长度（相交（a_分割，b_分割））
[1] 3
此方法可推广到n个向量：
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."

library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]

[1] "Africa"  "Roy"     "travels"

a实际上我并不建议这样做，但使用“stra”和“strb”，您可能只需执行merge（stra，strb）…参数“sep”需要更改为“split”->a\u split
  Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
  #[1] "Roy"     "travels" "Africa" 

> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3

a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."

library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]

[1] "Africa"  "Roy"     "travels"