R 计算两个字符串中的常用词
我有两条线:R 计算两个字符串中的常用词,r,string,text-mining,data-analysis,R,String,Text Mining,Data Analysis,我有两条线: a <- "Roy lives in Japan and travels to Africa" b <- "Roy travels Africa with this wife" a也许,使用intersect和stru-extract 对于多个字符串,您可以将它们作为列表或向量 vec1 <- c(a,b) Reduce(`intersect`,str_extract_all(vec1, "\\w+")) #[1] "Roy" "travels"
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
a也许,使用intersect
和stru-extract
对于多个字符串
,您可以将它们作为列表
或向量
vec1 <- c(a,b)
Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
#[1] "Roy" "travels" "Africa"
计数:
length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
#[1] 3
或使用base R
Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
#[1] "Roy" "travels" "Africa"
您可以使用和从base
库:
> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3
>a b a_分割b_分割长度(相交(a_分割,b_分割))
[1] 3
此方法可推广到n个向量:
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."
library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]
[1] "Africa" "Roy" "travels"
a实际上我并不建议这样做,但使用“stra”和“strb”,您可能只需执行merge(stra,strb)
…参数“sep”需要更改为“split”->a\u split
Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
#[1] "Roy" "travels" "Africa"
> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."
library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]
[1] "Africa" "Roy" "travels"