查找R中两个字符串列之间的匹配项
为了解决标记迁移问题,我必须比较两个字符列之间的差异,并评估两个列之间是否存在一致性 综上所述,给定如下数据帧:查找R中两个字符串列之间的匹配项,r,stringr,R,Stringr,为了解决标记迁移问题,我必须比较两个字符列之间的差异,并评估两个列之间是否存在一致性 综上所述,给定如下数据帧: old_tags new_tags burger burger, american italian, pizza italian latin, peruvian peruvian, latin french pizza 我想添加第三列,如下所示: old_tags ne
old_tags new_tags
burger burger, american
italian, pizza italian
latin, peruvian peruvian, latin
french pizza
我想添加第三列,如下所示:
old_tags new_tags match
burger burger, american TRUE
italian, pizza italian TRUE
latin, peruvian peruvian, latin TRUE
french pizza FALSE
到目前为止,我尝试了一些函数,如str\u match
,str\u detect
等等,但都没有成功。当比较实际应该是TRUE
的字符串对时,它通常会返回我FALSE
,例如我在[3,]
中给出的示例
提前非常感谢。一种基本的R方法可能是用逗号分割字符串。使用
Map
查找相交词,如果至少有一个值相交,则创建一个逻辑值
df$match <- lengths(Map(intersect, strsplit(df$old_tags, ", "),
strsplit(df$new_tags, ", "))) > 0
df
# old_tags new_tags match
#1 burger burger, american TRUE
#2 italian, pizza italian TRUE
#3 latin, peruvian peruvian, latin TRUE
#4 french pizza FALSE
df$match 0
df
#旧\u标记与新\u标记匹配
#一个汉堡包,美国真的
#2意大利比萨,意大利真比萨
#3拉丁语,秘鲁语,秘鲁语,拉丁语
#4法国披萨假
数据
df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian",
"french"), new_tags = c("burger, american", "italian", "peruvian, latin",
"pizza")), row.names = c(NA, -4L), class = "data.frame")
df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian",
"french"), new_tags = c("burger, american", "italian", "peruvian, latin",
"pizza")), row.names = c(NA, -4L), class = "data.frame")
dfAtidyverse
-base
可能性:
library(dplyr)
library(stringr)
df %>%
mutate(patterns = map_chr(strsplit(old_tags, ", "),paste,collapse="|"),
Match = str_detect(new_tags, patterns)) %>%
select(-patterns)
old_tags new_tags Match
1 burger burger, american TRUE
2 italian, pizza italian TRUE
3 latin, peruvian peruvian, latin TRUE
4 french pizza FALSE
或者我们可以使用any
library(tidyverse)
df %>%
mutate(match = map2_lgl(str_extract_all(old_tags, "\\w+"),
str_extract_all(new_tags, "\\w+"), ~ any(.x %in% .y)))
# old_tags new_tags match
#1 burger burger, american TRUE
#2 italian, pizza italian TRUE
#3 latin, peruvian peruvian, latin TRUE
#4 french pizza FALSE
数据
df