为什么'str_extract'只捕获其中一些值?
我有一个表,它有一个“成员类型”列,其中包括我们多年来使用的无数不同的成员级别为什么'str_extract'只捕获其中一些值?,r,tidyverse,stringr,R,Tidyverse,Stringr,我有一个表,它有一个“成员类型”列,其中包括我们多年来使用的无数不同的成员级别 example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N",
example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N",
"Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N",
"Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
"Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N",
"Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ",
"Individual (2 yr)",
"Individual Producer (Yearly)",
"Student Membership (Yearly)" ))
但这只捕获了一半的值,我无法在它跳过的内容中找到模式
1 Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2 Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3 Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G NA
4 Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N NA
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
6 Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N NA
7 Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N NA
8 Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N NA
9 Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G NA
10 Individual (2 yr) NA
11 Individual Producer (Yearly) Yearly
12 Student Membership (Yearly) NA
第4行和第5行之间的唯一区别是付款ID。为什么它只在第5行中查找搜索值
我该如何修复它。但主要原因是什么?我们可以使用
library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>%
mutate(term = str_extract(membership,
str_c(pattern_vec, collapse="|")))
# membership term
#1 Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2 Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3 Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4 Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#6 Individual (2 yr) 2 yr
#7 Individual Producer (Yearly) Yearly
#8 Student Membership (Yearly) Yearly
OP的说明:
一篇帮助我(OP)理解上述解释的帖子:
当使用单个模式馈送时,str\u replace\u all
将该模式与每个元素进行比较。但是,如果向其传递一个向量,它将尝试遵守顺序,因此将第一个模式与第一个对象进行比较,然后将第二个模式与第二个对象进行比较
您可以使用更复杂的正则表达式,使用lookback和lookahead:
example$term <- example$membership %>%
str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")
你能详细解释一下为什么这个会起作用而这个例子却不起作用吗?肯定是一个更优雅的解决方案,但是。。。为什么第一个版本只匹配一些结果?我的直觉是——但实际上只是一种直觉——因为第一个版本是由多达四个不同的文本匹配串联而成的,而我的正则表达式是针对单个(尽管有
)、紧凑、,和抽象匹配。我的直觉是单一性起了作用,这一点得到了@akrun的解决方案的支持,该解决方案也完全匹配,也将四种不同的模式合并为一种。在RStudio社区帖子中为@akrun的解决方案添加了一条注释。-解释了匹配是如何工作的。完全证实了我的怀疑,即是连接导致了不完美的匹配。
out1 <- example %>%
mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))
out2 <- example %>%
mutate(term = str_extract(membership, pattern_vec))
identical(out1, out2)
#[1] TRUE
out1
# membership term
#1 Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2 Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3 Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G <NA>
#4 Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#6 Individual (2 yr) <NA>
#7 Individual Producer (Yearly) Yearly
#8 Student Membership (Yearly) <NA>
example$term <- example$membership %>%
str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")
example
membership term
1 Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2 Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3 Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
4 Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
6 Individual (2 yr) 2 yr
7 Individual Producer (Yearly) Yearly
8 Student Membership (Yearly) Yearly