字符串中元素的R-Regex匹配

字符串中元素的R-Regex匹配,r,regex,R,Regex,我想知道字符串中的任何元素是否出现在其他字符串中 “我的数据”包含数百万行,其结构如下所示: dt <- data.table(product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"), stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D

我想知道字符串中的任何元素是否出现在其他字符串中

“我的数据”包含数百万行,其结构如下所示:

dt <- data.table(product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"), stringsAsFactors = F)


dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]

> dt
product     stock product.2 product.3
 1:       A         A         B         C
 2:       B       A,B         C     A,C,E
 3:       C     A,B,C     A,C,E       A,B
 4:   A,C,E   A,B,C,E       A,B     A,B,C
 5:     A,B   A,B,C,E     A,B,C         D
 6:   A,B,C   A,B,C,E         D         A
 7:       D A,B,C,D,E         A         B
 8:       A         A         B         A
 9:       B       A,B         A         A
10:       A       A,B         A     A,B,C
11:       A         A     A,B,C         D
12:   A,B,C     A,B,C         D         D
13:       D   A,B,C,D         D      <NA>
14:       D   A,B,C,D      <NA>      <NA>
这个问题是Stackoverflow问题的一部分

编辑2019年8月20日:包括第三个预期结果。

以逗号拆分产品.3,然后使用grepl检查产品.2或库存中是否存在该产品

在逗号处拆分产品.3,然后使用grepl检查它是否存在于产品.2或库存中

根据?str_检测

在字符串和模式上矢量化。相当于greplpattern,x。参见str_,其等效于greppattern,x

因此,一种选择是将、替换为|或匹配项,并直接将“product.2”的对应元素与“product.3”进行比较,与“product.3”的“stock”比较类似。然后,将NA元素替换为FALSE和set

使现代化 关于OP问题中的更新

dt[, paste0("outcome", 1:2) := lapply(.SD, function(x) 
     str_detect(product.3, str_replace_all(x, ",", "|"))), 
          .SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
       x1 <- sort(x[!is.na(x)])
       y1 <- sort(y[!is.na(y)]);
       length(intersect(x1, y1)) == length(x1)},
       str_extract_all(product.3, "[A-Z]"), 
       str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]

for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
#    product     stock product.2 product.3 outcome1 outcome2 outcome3
# 1:       A         A         B         C    FALSE    FALSE    FALSE
# 2:       B       A,B         C     A,C,E     TRUE     TRUE    FALSE
# 3:       C     A,B,C     A,C,E       A,B     TRUE     TRUE     TRUE
# 4:   A,C,E   A,B,C,E       A,B     A,B,C     TRUE     TRUE     TRUE
# 5:     A,B   A,B,C,E     A,B,C         D    FALSE    FALSE    FALSE
# 6:   A,B,C   A,B,C,E         D         A    FALSE     TRUE     TRUE
# 7:       D A,B,C,D,E         A         B    FALSE     TRUE     TRUE
# 8:       A         A         B         A    FALSE     TRUE     TRUE
# 9:       B       A,B         A         A     TRUE     TRUE     TRUE
#10:       A       A,B         A     A,B,C     TRUE     TRUE    FALSE
#11:       A         A     A,B,C         D    FALSE    FALSE    FALSE
#12:   A,B,C     A,B,C         D         D     TRUE    FALSE    FALSE
#13:       D   A,B,C,D         D      <NA>    FALSE    FALSE    FALSE
#14:       D   A,B,C,D      <NA>      <NA>    FALSE    FALSE    FALSE
根据?str_检测

在字符串和模式上矢量化。相当于greplpattern,x。参见str_,其等效于greppattern,x

因此,一种选择是将、替换为|或匹配项,并直接将“product.2”的对应元素与“product.3”进行比较,与“product.3”的“stock”比较类似。然后,将NA元素替换为FALSE和set

使现代化 关于OP问题中的更新

dt[, paste0("outcome", 1:2) := lapply(.SD, function(x) 
     str_detect(product.3, str_replace_all(x, ",", "|"))), 
          .SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
       x1 <- sort(x[!is.na(x)])
       y1 <- sort(y[!is.na(y)]);
       length(intersect(x1, y1)) == length(x1)},
       str_extract_all(product.3, "[A-Z]"), 
       str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]

for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
#    product     stock product.2 product.3 outcome1 outcome2 outcome3
# 1:       A         A         B         C    FALSE    FALSE    FALSE
# 2:       B       A,B         C     A,C,E     TRUE     TRUE    FALSE
# 3:       C     A,B,C     A,C,E       A,B     TRUE     TRUE     TRUE
# 4:   A,C,E   A,B,C,E       A,B     A,B,C     TRUE     TRUE     TRUE
# 5:     A,B   A,B,C,E     A,B,C         D    FALSE    FALSE    FALSE
# 6:   A,B,C   A,B,C,E         D         A    FALSE     TRUE     TRUE
# 7:       D A,B,C,D,E         A         B    FALSE     TRUE     TRUE
# 8:       A         A         B         A    FALSE     TRUE     TRUE
# 9:       B       A,B         A         A     TRUE     TRUE     TRUE
#10:       A       A,B         A     A,B,C     TRUE     TRUE    FALSE
#11:       A         A     A,B,C         D    FALSE    FALSE    FALSE
#12:   A,B,C     A,B,C         D         D     TRUE    FALSE    FALSE
#13:       D   A,B,C,D         D      <NA>    FALSE    FALSE    FALSE
#14:       D   A,B,C,D      <NA>      <NA>    FALSE    FALSE    FALSE

伟大的str_detect中的x和pattern不应该被切换吗?你对根据原始答案编辑的3个问题有什么想法吗对不起,没有机会看一下。我太棒了!str_detect中的x和pattern不应该被切换吗?你对根据原始答案编辑的3个问题有什么想法吗对不起,没有机会看一下。考虑到对grepl的调用,这种方法是否可以很好地扩展,考虑到对grepl的调用
library(data.table)
library(stringr)
dt[, outcome1 := str_detect(product.2, str_replace_all(product.3, ",", "|"))]
dt[, outcome2 := str_detect(stock, str_replace_all(product.3, ",", "|"))]
for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
#    product     stock product.2 product.3 outcome1 outcome2
# 1:       A         A         B         C    FALSE    FALSE
# 2:       B       A,B         C     A,C,E     TRUE     TRUE
# 3:       C     A,B,C     A,C,E       A,B     TRUE     TRUE
# 4:   A,C,E   A,B,C,E       A,B     A,B,C     TRUE     TRUE
# 5:     A,B   A,B,C,E     A,B,C         D    FALSE    FALSE
# 6:   A,B,C   A,B,C,E         D         A    FALSE     TRUE
# 7:       D A,B,C,D,E         A         B    FALSE     TRUE
# 8:       A         A         B         A    FALSE     TRUE
# 9:       B       A,B         A         A     TRUE     TRUE
#10:       A       A,B         A     A,B,C     TRUE     TRUE
#11:       A         A     A,B,C         D    FALSE    FALSE
#12:   A,B,C     A,B,C         D         D     TRUE    FALSE
#13:       D   A,B,C,D         D      <NA>    FALSE    FALSE
#14:       D   A,B,C,D      <NA>      <NA>    FALSE    FALSE
dt[, paste0("outcome", 1:2) := lapply(.SD, str_detect, 
 pattern = str_replace_all(product.3, ",", "|")), .SDcols = c("product.2", "stock")]
dt[, paste0("outcome", 1:2) := lapply(.SD, function(x) 
     str_detect(product.3, str_replace_all(x, ",", "|"))), 
          .SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
       x1 <- sort(x[!is.na(x)])
       y1 <- sort(y[!is.na(y)]);
       length(intersect(x1, y1)) == length(x1)},
       str_extract_all(product.3, "[A-Z]"), 
       str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]

for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
#    product     stock product.2 product.3 outcome1 outcome2 outcome3
# 1:       A         A         B         C    FALSE    FALSE    FALSE
# 2:       B       A,B         C     A,C,E     TRUE     TRUE    FALSE
# 3:       C     A,B,C     A,C,E       A,B     TRUE     TRUE     TRUE
# 4:   A,C,E   A,B,C,E       A,B     A,B,C     TRUE     TRUE     TRUE
# 5:     A,B   A,B,C,E     A,B,C         D    FALSE    FALSE    FALSE
# 6:   A,B,C   A,B,C,E         D         A    FALSE     TRUE     TRUE
# 7:       D A,B,C,D,E         A         B    FALSE     TRUE     TRUE
# 8:       A         A         B         A    FALSE     TRUE     TRUE
# 9:       B       A,B         A         A     TRUE     TRUE     TRUE
#10:       A       A,B         A     A,B,C     TRUE     TRUE    FALSE
#11:       A         A     A,B,C         D    FALSE    FALSE    FALSE
#12:   A,B,C     A,B,C         D         D     TRUE    FALSE    FALSE
#13:       D   A,B,C,D         D      <NA>    FALSE    FALSE    FALSE
#14:       D   A,B,C,D      <NA>      <NA>    FALSE    FALSE    FALSE