R 如果一列包含与另一列匹配的单词
假设R 如果一列包含与另一列匹配的单词,r,R,假设A和B是数据集中的列,我想开发一个模糊匹配逻辑-如果A列中至少有一个单词与B列中的一个单词匹配,除了单词“bank”和“of”,我们在新列中分配1,如果有0个匹配,我们分配0。我想在R做这个 A B BANK OF AMERICA CHASE BANK BANK OF AMERICA BANK OF AMERICA, N.A. BANK OF HOPE HOPE BANK T.D BANK
A
和B
是数据集中的列,我想开发一个模糊匹配逻辑-如果A列中至少有一个单词与B列中的一个单词匹配,除了单词“bank”和“of”,我们在新列中分配1
,如果有0个匹配,我们分配0
。我想在R做这个
A B
BANK OF AMERICA CHASE BANK
BANK OF AMERICA BANK OF AMERICA, N.A.
BANK OF HOPE HOPE BANK
T.D BANK CHASE BANK
预期产量
A B C
BANK OF AMERICA CHASE BANK 0
BANK OF AMERICA BANK OF AMERICA, N.A 1
BANK OF HOPE HOPE BANK 1
T.D. BANK CHASE BANK 0
我相信regex和apply的结合在这里很有效
> df <- data.frame(A = c('BANK OF AMERICA', 'BANK OF AMERICA', 'BANK OF HOPE', 'T.D BANK'),
B = c('CHASE BANK', 'BANK OF AMERICA, N.A.', 'HOPE BANK', 'CHASE BANK'),
stringsAsFactors = FALSE)
> f <- function(x) {
left <- strsplit(x[1], "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
right <- strsplit(x[2], "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
ans <- left %in% right
as.integer(all(ans[!(left %in% "")]))
}
> df$C <- apply(df, 1, f)
> df
A B C
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0
>df f以下基本R选项可能会有所帮助
df$C <- +do.call(
function(...) mapply(function(...) any(!intersect(...) %in% c("BANK","OF")),...),
Map(function(x) strsplit(x,"[[:punct:][:blank:]]",perl = TRUE), df, USE.NAMES = FALSE)
)
这里是另一个选项-使用dplyr
和stringr
df <- data.frame(A = c(rep("BANK OF AMERICA", 2), "BANK OF HOPE", "T.D BANK"),
B = c("CHASE BANK", "BANK OF AMERICA, N.A.", "HOPE BANK", "CHASE BANK"),
stringsAsFactors = FALSE)
df <- df %>%
mutate(C = str_remove_all(B, c("BANK|OF|,")), #remove stopwords
C = str_trim(C), #remove whitespace from start/end
C = str_replace_all(C, " ", ""), #remove double whitespaces
C = str_replace_all(C, " ", "|")) %>% #replace whitespace with |
mutate(D = as.numeric(str_detect(A, C))) %>%
select(A, B, D)
A B D
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0
df%#将空格替换为|
变异(D=as.numeric(str_detect(A,C)))%>%
选择(A、B、D)
A、B、D
1美国银行大通银行0
2美国银行美国银行,N.A.1
3希望银行1号银行
4T.D银行大通银行0
请在您的问题中始终以R代码的形式提供您的数据(以帮助我们快速得出可能的答案)+显示示例的预期结果(我想这在您的情况下可能微不足道-只有第二行获得1
-如果我也忽略停止词“of”),是的,在我的情况下,只有第二行获得1。我将编辑我的帖子。你想处理多少行(粗略估计)?如果您想处理数百万行,并给出迄今为止的答案,那么性能可能至关重要(尽管很难想象您正在处理这么多不同的法律实体)…实际上并没有那么多。大约10k rowsOK,因此请用绿色勾选最佳工作答案(如果有;-)。谢谢!
df <- data.frame(A = c(rep("BANK OF AMERICA", 2), "BANK OF HOPE", "T.D BANK"),
B = c("CHASE BANK", "BANK OF AMERICA, N.A.", "HOPE BANK", "CHASE BANK"),
stringsAsFactors = FALSE)
df <- df %>%
mutate(C = str_remove_all(B, c("BANK|OF|,")), #remove stopwords
C = str_trim(C), #remove whitespace from start/end
C = str_replace_all(C, " ", ""), #remove double whitespaces
C = str_replace_all(C, " ", "|")) %>% #replace whitespace with |
mutate(D = as.numeric(str_detect(A, C))) %>%
select(A, B, D)
A B D
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0