R 如何使用正则表达式在case_when语句中提取特定的字符串模式?
考虑我在唐纳德·特朗普推特数据集(可以找到)的基础上创建的以下可复制数据集: 我尝试过的: 实际上,我已使用以下代码创建了所需的输出:R 如何使用正则表达式在case_when语句中提取特定的字符串模式?,r,regex,stringr,R,Regex,Stringr,考虑我在唐纳德·特朗普推特数据集(可以找到)的基础上创建的以下可复制数据集: 我尝试过的: 实际上,我已使用以下代码创建了所需的输出: df %>% mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~ str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
TRUE ~ "other"))
但随后我收到以下错误消息:
Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target,
"jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.
问题:
最后,我想知道在case_when-statement中是否有一种提取特定字符串模式的简洁方法。当我无法使用“环顾四周”(因为我无法定义有界的最大长度)或捕获组(因为
stru match
将返回长度为20的向量,而不是原始大小为10或1的向量)时,我将如何解决我在这里提到的问题?一个选项是在case\u中检查字符串开头(^
)处是否有带“jeb-”子字符串的目标列,然后在字符串结尾($
)提取非-
([^-]+
)的字符,或者(TRUE
)返回“其他”字符
library(dplyr)
library(stringr)
df %>%
mutate(new_var = case_when(str_detect(target, '^jeb-')~
str_extract(target, '[^-]+$'), TRUE ~ 'other'))
-输出
# A tibble: 10 x 3
# target tweet_id new_var
# <chr> <dbl> <chr>
# 1 jeb-bush 1 bush
# 2 jeb-bush 2 bush
# 3 jeb-bush-supporters 3 supporters
# 4 jeb-bush-supporters 4 supporters
# 5 jeb-staffer 5 staffer
# 6 the-media 6 other
# 7 the-media 7 other
# 8 the-media 8 other
# 9 the-media 9 other
#10 the-media 10 other
非常感谢。一个问题:在coalesce()
函数的正则表达式中,?
的作用是什么?它与懒惰有关。你可以查一下
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_match(target, "jeb-([a-z]+)"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_match(target, "jeb-[a-z]+-([a-z]+)"),
TRUE ~ "other"))
Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target,
"jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.
library(dplyr)
library(stringr)
df %>%
mutate(new_var = case_when(str_detect(target, '^jeb-')~
str_extract(target, '[^-]+$'), TRUE ~ 'other'))
# A tibble: 10 x 3
# target tweet_id new_var
# <chr> <dbl> <chr>
# 1 jeb-bush 1 bush
# 2 jeb-bush 2 bush
# 3 jeb-bush-supporters 3 supporters
# 4 jeb-bush-supporters 4 supporters
# 5 jeb-staffer 5 staffer
# 6 the-media 6 other
# 7 the-media 7 other
# 8 the-media 8 other
# 9 the-media 9 other
#10 the-media 10 other
df %>%
mutate(new_var = coalesce(str_match(target, '^jeb-.*?([^-]+)$')[,2], 'other'))