R tidyr regex:从字符列中提取有序数字
假设我有一个这样的数据帧R tidyr regex:从字符列中提取有序数字,r,regex,tidyr,regex-lookarounds,R,Regex,Tidyr,Regex Lookarounds,假设我有一个这样的数据帧 df <- data.frame(x=c("This script outputs 10 visualizations.", "This script outputs 1 visualization.", "This script outputs 5 data files.", "This script outputs 1 data
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
是否有一种简单的方法,可以使用Tidyverse
提取每行的可视化数量和文件数量?当没有可视化(或没有数据文件,或两者都没有)时,我希望提取0
。基本上,我希望最终结果是这样的
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
我试过用像这样的东西
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
str_extract(df$x,”(?我们可以在str_extract
中使用regex lookaround将一个或多个数字(\\d+
)后跟空格和“vis”或“data files”提取到两列中
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
在第一种情况下,模式匹配一个或多个数字(\\d+
),后跟regex lookaround((?=
),其中有一个空格后跟“vis”字,在第二列中,它提取后跟空格的数字和单词“file”或“files”基本R方法
df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1
df$viz您可以使用包unglue获得可读的解决方案,因为您有有限的可能模式,然后将NAs替换为0:
library(脱胶)
模式2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1
请您修改它,使其既能在数据文件
上工作,又能在数据文件
上工作。哦,我确认我只是用数据文件
替换了数据文件
我对regex知之甚少,您介意写一小段解释符号在做什么吗?@Euler\u Salter。我添加了s?
,以确保即使最后没有s
,它也会处理。Suree,每次关闭都会更新!我希望NA为零,但除此之外,这看起来不错