R两列中的精确匹配字符串
我有以下形式的数据框:R两列中的精确匹配字符串,r,string-matching,data-manipulation,R,String Matching,Data Manipulation,我有以下形式的数据框: Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words') Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more') d1 = data.frame(Column1,Column2) 我想做的是查找并计算第1列和第2列中单词的精确匹配。每列可以有多个
Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)
我想做的是查找并计算第1列和第2列中单词的精确匹配。每列可以有多个用逗号分隔的单词
例如,在第一行中,我们看到两个常用词a)星舰企业号和b)大象号。但是,在第4行中,即使单词“more”出现在两列中,也不会出现确切的字符串(更多的单词甚至更多的单词)。预期的输出是这样的
任何帮助都将不胜感激。以逗号分隔列,并计算单词的交叉点
mapply(function(x, y) length(intersect(x, y)),
strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0
或者是一种
tidyverse
方式
library(tidyverse)
d1 %>%
mutate(Common = map2_dbl(Column1, Column2, ~
length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))
# Column1 Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2 Random word Ocean 0
#3 Word No 0
#4 Some more words, Even more words more 0
我们可以使用
cSplit
library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[,
length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
# Column1 Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2: Random word Ocean 0
#3: Word No 0
#4: Some more words, Even more words more 0
库(splitstackshape)
库(数据表)
v1