为什么'str_extract'只捕获其中一些值?

为什么'str_extract'只捕获其中一些值?,r,tidyverse,stringr,R,Tidyverse,Stringr,我有一个表,它有一个“成员类型”列,其中包括我们多年来使用的无数不同的成员级别 example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N",

我有一个表,它有一个“成员类型”列,其中包括我们多年来使用的无数不同的成员级别

example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N", 
                              "Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
                              "Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ", 
                              "Individual (2 yr)",
                              "Individual Producer (Yearly)",
                              "Student Membership (Yearly)"  ))
但这只捕获了一半的值,我无法在它跳过的内容中找到模式

1   Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
2   Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N  Period Paid: 2
3   Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G  NA
4   Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N  NA
5   Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6   Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N  NA
7   Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N  NA
8   Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N  NA
9   Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G  NA
10  Individual (2 yr)                                                   NA
11  Individual Producer (Yearly)                                        Yearly
12  Student Membership (Yearly)                                         NA
第4行和第5行之间的唯一区别是付款ID。为什么它只在第5行中查找搜索值


我该如何修复它。但主要原因是什么?

我们可以使用

library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>% 
      mutate(term = str_extract(membership,
      str_c(pattern_vec, collapse="|")))
#                                                       membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           2 yr
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)         Yearly
OP的说明:

一篇帮助我(OP)理解上述解释的帖子:

当使用单个模式馈送时,
str\u replace\u all
将该模式与每个元素进行比较。但是,如果向其传递一个向量,它将尝试遵守顺序,因此将第一个模式与第一个对象进行比较,然后将第二个模式与第二个对象进行比较


您可以使用更复杂的正则表达式,使用lookback和lookahead:

example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")

你能详细解释一下为什么这个会起作用而这个例子却不起作用吗?肯定是一个更优雅的解决方案,但是。。。为什么第一个版本只匹配一些结果?我的直觉是——但实际上只是一种直觉——因为第一个版本是由多达四个不同的文本匹配串联而成的,而我的正则表达式是针对单个(尽管有
)、紧凑、,和抽象匹配。我的直觉是单一性起了作用,这一点得到了@akrun的解决方案的支持,该解决方案也完全匹配,也将四种不同的模式合并为一种。在RStudio社区帖子中为@akrun的解决方案添加了一条注释。-解释了匹配是如何工作的。完全证实了我的怀疑,即是连接导致了不完美的匹配。
out1 <- example %>% 
      mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))

out2 <- example %>% 
            mutate(term = str_extract(membership,  pattern_vec))
identical(out1, out2)
#[1] TRUE



out1
#                                                           membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G           <NA>
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N           <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           <NA>
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)           <NA>
example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")
example
                                                           membership           term
1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6                                                   Individual (2 yr)           2 yr
7                                        Individual Producer (Yearly)         Yearly
8                                         Student Membership (Yearly)         Yearly