查找R中两个字符串列之间的匹配项

查找R中两个字符串列之间的匹配项,r,stringr,R,Stringr,为了解决标记迁移问题,我必须比较两个字符列之间的差异,并评估两个列之间是否存在一致性 综上所述,给定如下数据帧: old_tags new_tags burger burger, american italian, pizza italian latin, peruvian peruvian, latin french pizza 我想添加第三列,如下所示: old_tags ne

为了解决标记迁移问题,我必须比较两个字符列之间的差异,并评估两个列之间是否存在一致性

综上所述,给定如下数据帧:

old_tags            new_tags
burger              burger, american
italian, pizza      italian
latin, peruvian     peruvian, latin
french              pizza
我想添加第三列,如下所示:

old_tags            new_tags            match
burger              burger, american    TRUE
italian, pizza      italian             TRUE
latin, peruvian     peruvian, latin     TRUE
french              pizza               FALSE
到目前为止,我尝试了一些函数,如
str\u match
str\u detect
等等,但都没有成功。当比较实际应该是
TRUE
的字符串对时,它通常会返回我
FALSE
,例如我在
[3,]
中给出的示例


提前非常感谢。

一种基本的R方法可能是用逗号分割字符串。使用
Map
查找相交词,如果至少有一个值相交,则创建一个逻辑值

df$match <- lengths(Map(intersect, strsplit(df$old_tags, ", "), 
                    strsplit(df$new_tags, ", "))) > 0

df
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE
df$match 0
df
#旧\u标记与新\u标记匹配
#一个汉堡包,美国真的
#2意大利比萨,意大利真比萨
#3拉丁语,秘鲁语,秘鲁语,拉丁语
#4法国披萨假
数据

df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")
df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")

dfA
tidyverse
-
base
可能性:

library(dplyr)
library(stringr)

df %>% 
   mutate(patterns = map_chr(strsplit(old_tags, ", "),paste,collapse="|"),
          Match = str_detect(new_tags, patterns)) %>% 
   select(-patterns)
         old_tags         new_tags Match
1          burger burger, american  TRUE
2  italian, pizza          italian  TRUE
3 latin, peruvian  peruvian, latin  TRUE
4          french            pizza FALSE

或者我们可以使用
any

library(tidyverse)
df %>% 
   mutate(match = map2_lgl(str_extract_all(old_tags, "\\w+"), 
               str_extract_all(new_tags, "\\w+"),  ~ any(.x %in% .y)))
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE
数据
df