为什么'str_extract'只捕获其中一些值？_R_Tidyverse_Stringr

为什么'str_extract'只捕获其中一些值？

为什么'str_extract'只捕获其中一些值？,r,tidyverse,stringr,R,Tidyverse,Stringr,我有一个表，它有一个“成员类型”列，其中包括我们多年来使用的无数不同的成员级别 example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N",

我有一个表，它有一个“成员类型”列，其中包括我们多年来使用的无数不同的成员级别

example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N", 
                              "Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
                              "Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ", 
                              "Individual (2 yr)",
                              "Individual Producer (Yearly)",
                              "Student Membership (Yearly)"  ))

但这只捕获了一半的值，我无法在它跳过的内容中找到模式

1   Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
2   Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N  Period Paid: 2
3   Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G  NA
4   Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N  NA
5   Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6   Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N  NA
7   Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N  NA
8   Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N  NA
9   Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G  NA
10  Individual (2 yr)                                                   NA
11  Individual Producer (Yearly)                                        Yearly
12  Student Membership (Yearly)                                         NA

第4行和第5行之间的唯一区别是付款ID。为什么它只在第5行中查找搜索值

我该如何修复它。但主要原因是什么？

我们可以使用

library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>% 
      mutate(term = str_extract(membership,
      str_c(pattern_vec, collapse="|")))
#                                                       membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           2 yr
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)         Yearly

OP的说明：

一篇帮助我（OP）理解上述解释的帖子：

当使用单个模式馈送时，

str\u replace\u all

将该模式与每个元素进行比较。但是，如果向其传递一个向量，它将尝试遵守顺序，因此将第一个模式与第一个对象进行比较，然后将第二个模式与第二个对象进行比较

您可以使用更复杂的正则表达式，使用lookback和lookahead：

example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")

你能详细解释一下为什么这个会起作用而这个例子却不起作用吗？肯定是一个更优雅的解决方案，但是。。。为什么第一个版本只匹配一些结果？我的直觉是——但实际上只是一种直觉——因为第一个版本是由多达四个不同的文本匹配串联而成的，而我的正则表达式是针对单个（尽管有

）、紧凑、，和抽象匹配。我的直觉是单一性起了作用，这一点得到了@akrun的解决方案的支持，该解决方案也完全匹配，也将四种不同的模式合并为一种。在RStudio社区帖子中为@akrun的解决方案添加了一条注释。-解释了匹配是如何工作的。完全证实了我的怀疑，即是连接导致了不完美的匹配。

out1 <- example %>% 
      mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))

out2 <- example %>% 
            mutate(term = str_extract(membership,  pattern_vec))
identical(out1, out2)
#[1] TRUE



out1
#                                                           membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G           <NA>
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N           <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           <NA>
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)           <NA>

example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")

example
                                                           membership           term
1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6                                                   Individual (2 yr)           2 yr
7                                        Individual Producer (Yearly)         Yearly
8                                         Student Membership (Yearly)         Yearly