Regex R-正则表达式文本提取，可识别超过1个数字的数值_Regex_R

Regex R-正则表达式文本提取，可识别超过1个数字的数值

regex r

Regex R-正则表达式文本提取，可识别超过1个数字的数值,regex,r,Regex,R,我试图使用gregexpr和substr的组合从字符串中提取信息。每个字符串都有一个相位，以一个字开头，以一个数字结尾（有时大于9）以下是字符串列表： y = c("Hearing #3: The document states in Article ABC 3 Section 9 line 10 that...", "Hearing #3: The document states in Article ABC 31 Section 9 that...", "Hearing #3: T

我试图使用

gregexpr

和

substr

的组合从字符串中提取信息。每个字符串都有一个相位，以一个字开头，以一个数字结尾（有时大于9）

以下是字符串列表：

y = c("Hearing #3: The document states in Article ABC 3 Section 9 line 10 that...",
  "Hearing #3: The document states in Article ABC 31 Section 9 that...",
  "Hearing #3: The document states in Article ABC 3.1 Section 9 that...")

现在，我把我感兴趣的短语开头的

文章

之前的所有内容都删掉了：

z = substr(y, gregexpr("Article", y)[[1]][1], nchar(y))

> z
[1] "Article ABC 3 Section 9 line 10 that..."   "Article ABC 31 Section 9 that..."  "Article ABC 3.1 Section 9 that..."

到目前为止还不错，但现在我需要识别单词

文章

后的第一个数字（不是数字）：

> substr(z, 0, regexpr(pattern='[0-9]', z)[1][1])
[1] "Article ABC 3" "Article ABC 3" "Article ABC 3"

这还不够，所以我试着想办法通过另一个

gregxepr

定位来实现这一点：

gregexpr(pattern='[0-9]', z)

我不知道该怎么做，我甚至不确定我是否用正确的方法来做

所需输出为：

[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"

我们可以使用

stru extract

from

stringr

将子字符串从“Article”提取到数字部分，包括

 library(stringr)
 str_extract(y, 'Article[^0-9]*[0-9.]+')
 #[1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

或者使用

sub

，我们匹配

文章

，后跟0个或多个非数字（

[^0-9]*

），后跟一个或多个数字字符（

[0-9.]+

），将该捕获组放在括号内使用。它可以用作替换（

\\1

）

您可以通过在查找编号后添加一个否定类来解决问题

substr(z, 0, regexpr('[0-9][^0-9.]', z))
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

对于此任务，使用

sub

要简单得多：

sub('.*(Article\\D*[0-9.]+).*', '\\1', y)
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

sub('.*(Article\\D*[0-9.]+).*', '\\1', y)
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"