字符串中元素的R-Regex匹配
我想知道字符串中的任何元素是否出现在其他字符串中 “我的数据”包含数百万行,其结构如下所示:字符串中元素的R-Regex匹配,r,regex,R,Regex,我想知道字符串中的任何元素是否出现在其他字符串中 “我的数据”包含数百万行,其结构如下所示: dt <- data.table(product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"), stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D
dt <- data.table(product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"), stringsAsFactors = F)
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
> dt
product stock product.2 product.3
1: A A B C
2: B A,B C A,C,E
3: C A,B,C A,C,E A,B
4: A,C,E A,B,C,E A,B A,B,C
5: A,B A,B,C,E A,B,C D
6: A,B,C A,B,C,E D A
7: D A,B,C,D,E A B
8: A A B A
9: B A,B A A
10: A A,B A A,B,C
11: A A A,B,C D
12: A,B,C A,B,C D D
13: D A,B,C,D D <NA>
14: D A,B,C,D <NA> <NA>
这个问题是Stackoverflow问题的一部分
编辑2019年8月20日:包括第三个预期结果。以逗号拆分产品.3,然后使用grepl检查产品.2或库存中是否存在该产品
在逗号处拆分产品.3,然后使用grepl检查它是否存在于产品.2或库存中
根据?str_检测
在字符串和模式上矢量化。相当于greplpattern,x。参见str_,其等效于greppattern,x
因此,一种选择是将、替换为|或匹配项,并直接将“product.2”的对应元素与“product.3”进行比较,与“product.3”的“stock”比较类似。然后,将NA元素替换为FALSE和set
使现代化
关于OP问题中的更新
dt[, paste0("outcome", 1:2) := lapply(.SD, function(x)
str_detect(product.3, str_replace_all(x, ",", "|"))),
.SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
x1 <- sort(x[!is.na(x)])
y1 <- sort(y[!is.na(y)]);
length(intersect(x1, y1)) == length(x1)},
str_extract_all(product.3, "[A-Z]"),
str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]
for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
# product stock product.2 product.3 outcome1 outcome2 outcome3
# 1: A A B C FALSE FALSE FALSE
# 2: B A,B C A,C,E TRUE TRUE FALSE
# 3: C A,B,C A,C,E A,B TRUE TRUE TRUE
# 4: A,C,E A,B,C,E A,B A,B,C TRUE TRUE TRUE
# 5: A,B A,B,C,E A,B,C D FALSE FALSE FALSE
# 6: A,B,C A,B,C,E D A FALSE TRUE TRUE
# 7: D A,B,C,D,E A B FALSE TRUE TRUE
# 8: A A B A FALSE TRUE TRUE
# 9: B A,B A A TRUE TRUE TRUE
#10: A A,B A A,B,C TRUE TRUE FALSE
#11: A A A,B,C D FALSE FALSE FALSE
#12: A,B,C A,B,C D D TRUE FALSE FALSE
#13: D A,B,C,D D <NA> FALSE FALSE FALSE
#14: D A,B,C,D <NA> <NA> FALSE FALSE FALSE
根据?str_检测
在字符串和模式上矢量化。相当于greplpattern,x。参见str_,其等效于greppattern,x
因此,一种选择是将、替换为|或匹配项,并直接将“product.2”的对应元素与“product.3”进行比较,与“product.3”的“stock”比较类似。然后,将NA元素替换为FALSE和set
使现代化
关于OP问题中的更新
dt[, paste0("outcome", 1:2) := lapply(.SD, function(x)
str_detect(product.3, str_replace_all(x, ",", "|"))),
.SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
x1 <- sort(x[!is.na(x)])
y1 <- sort(y[!is.na(y)]);
length(intersect(x1, y1)) == length(x1)},
str_extract_all(product.3, "[A-Z]"),
str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]
for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
# product stock product.2 product.3 outcome1 outcome2 outcome3
# 1: A A B C FALSE FALSE FALSE
# 2: B A,B C A,C,E TRUE TRUE FALSE
# 3: C A,B,C A,C,E A,B TRUE TRUE TRUE
# 4: A,C,E A,B,C,E A,B A,B,C TRUE TRUE TRUE
# 5: A,B A,B,C,E A,B,C D FALSE FALSE FALSE
# 6: A,B,C A,B,C,E D A FALSE TRUE TRUE
# 7: D A,B,C,D,E A B FALSE TRUE TRUE
# 8: A A B A FALSE TRUE TRUE
# 9: B A,B A A TRUE TRUE TRUE
#10: A A,B A A,B,C TRUE TRUE FALSE
#11: A A A,B,C D FALSE FALSE FALSE
#12: A,B,C A,B,C D D TRUE FALSE FALSE
#13: D A,B,C,D D <NA> FALSE FALSE FALSE
#14: D A,B,C,D <NA> <NA> FALSE FALSE FALSE
伟大的str_detect中的x和pattern不应该被切换吗?你对根据原始答案编辑的3个问题有什么想法吗对不起,没有机会看一下。我太棒了!str_detect中的x和pattern不应该被切换吗?你对根据原始答案编辑的3个问题有什么想法吗对不起,没有机会看一下。考虑到对grepl的调用,这种方法是否可以很好地扩展,考虑到对grepl的调用
library(data.table)
library(stringr)
dt[, outcome1 := str_detect(product.2, str_replace_all(product.3, ",", "|"))]
dt[, outcome2 := str_detect(stock, str_replace_all(product.3, ",", "|"))]
for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
# product stock product.2 product.3 outcome1 outcome2
# 1: A A B C FALSE FALSE
# 2: B A,B C A,C,E TRUE TRUE
# 3: C A,B,C A,C,E A,B TRUE TRUE
# 4: A,C,E A,B,C,E A,B A,B,C TRUE TRUE
# 5: A,B A,B,C,E A,B,C D FALSE FALSE
# 6: A,B,C A,B,C,E D A FALSE TRUE
# 7: D A,B,C,D,E A B FALSE TRUE
# 8: A A B A FALSE TRUE
# 9: B A,B A A TRUE TRUE
#10: A A,B A A,B,C TRUE TRUE
#11: A A A,B,C D FALSE FALSE
#12: A,B,C A,B,C D D TRUE FALSE
#13: D A,B,C,D D <NA> FALSE FALSE
#14: D A,B,C,D <NA> <NA> FALSE FALSE
dt[, paste0("outcome", 1:2) := lapply(.SD, str_detect,
pattern = str_replace_all(product.3, ",", "|")), .SDcols = c("product.2", "stock")]
dt[, paste0("outcome", 1:2) := lapply(.SD, function(x)
str_detect(product.3, str_replace_all(x, ",", "|"))),
.SDcols = c('product.2', 'stock')]
dt[, outcome3 :=unlist(Map(function(x, y) {
x1 <- sort(x[!is.na(x)])
y1 <- sort(y[!is.na(y)]);
length(intersect(x1, y1)) == length(x1)},
str_extract_all(product.3, "[A-Z]"),
str_extract_all(stock, "[A-Z]"))) & !is.na(product.3)]
for(j in names(dt)[5:6]) set(dt, i = which(is.na(dt[[j]])), j = j, value = FALSE)
dt
# product stock product.2 product.3 outcome1 outcome2 outcome3
# 1: A A B C FALSE FALSE FALSE
# 2: B A,B C A,C,E TRUE TRUE FALSE
# 3: C A,B,C A,C,E A,B TRUE TRUE TRUE
# 4: A,C,E A,B,C,E A,B A,B,C TRUE TRUE TRUE
# 5: A,B A,B,C,E A,B,C D FALSE FALSE FALSE
# 6: A,B,C A,B,C,E D A FALSE TRUE TRUE
# 7: D A,B,C,D,E A B FALSE TRUE TRUE
# 8: A A B A FALSE TRUE TRUE
# 9: B A,B A A TRUE TRUE TRUE
#10: A A,B A A,B,C TRUE TRUE FALSE
#11: A A A,B,C D FALSE FALSE FALSE
#12: A,B,C A,B,C D D TRUE FALSE FALSE
#13: D A,B,C,D D <NA> FALSE FALSE FALSE
#14: D A,B,C,D <NA> <NA> FALSE FALSE FALSE