R：提取具有零宽度Lookaheads的bigram_R_Regex_Stringr_Lookahead

R：提取具有零宽度Lookaheads的bigram

r regex

R：提取具有零宽度Lookaheads的bigram,r,regex,stringr,lookahead,R,Regex,Stringr,Lookahead,我想使用所描述的正则表达式从句子中提取bigram，并将输出存储到引用原始语句的新列中因此，我猜问题可能与在捕获组周围使用零宽度前瞻有关。R中是否有允许提取这些bigram的有效正则表达式？正如@WiktorStribiżew所建议的，使用stru extract_all此处提供帮助。下面介绍如何将其应用于数据帧中的多行。让 (df <- data.frame(a = c("one two three", "four five six"))) # a # 1

我想使用所描述的正则表达式从句子中提取bigram，并将输出存储到引用原始语句的新列中

因此，我猜问题可能与在捕获组周围使用零宽度前瞻有关。R中是否有允许提取这些bigram的有效正则表达式？

正如@WiktorStribiżew所建议的，使用

stru extract_all

此处提供帮助。下面介绍如何将其应用于数据帧中的多行。让

(df <- data.frame(a = c("one two three", "four five six")))
#               a
# 1 one two three
# 2 four five six

（df%rowwise（）%%>%
do（data.frame（，b=str_match_all（.$a，“（？=（\\b\\w+\\s+\\w+））[[1][，2]，stringsAsFactors=FALSE））
#来源：本地数据帧[4 x 2]
#小组：
#
#一个tibble:4x2
#a b
# *              
#一一二三一二
#二一二三二三
#三四五六四五
#四四五六五六

其中

stringsAsFactors=FALSE

只是为了避免来自绑定行的警告。

str\u extract\u all

将丢失所有捕获的子匹配。您需要

str\u match\u all

。请注意，第一个元素始终是空元素，因为匹配总是空的，但组1值将进入

[，2]

。

# Bigrams - Fails
df %>%
  # Base R
  mutate(b =  sapply(regmatches(a,gregexpr("(?=(\\b\\w+\\s+\\w+))", a, perl = TRUE)),
                     paste, collapse=";")) %>%
  # Duplicate with Stringr
  mutate(c =  sapply(str_extract_all(a,"(?=(\\b\\w+\\s+\\w+))"),paste, collapse=";")) %>%
  cSplit(., c(2,3), sep = ";", direction = "long")

(df <- data.frame(a = c("one two three", "four five six")))
#               a
# 1 one two three
# 2 four five six

df %>% rowwise() %>% 
  do(data.frame(., b = str_match_all(.$a, "(?=(\\b\\w+\\s+\\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
#   a             b        
# * <fct>         <chr>    
# 1 one two three one two  
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six