在R中拆分文本字符串的正则表达式

在R中拆分文本字符串的正则表达式,r,regex,R,Regex,我有一个很长的字符串,比如下面的示例bellow,我正在努力找到一个正则表达式,根据patrn将其拆分为多个部分,例如:“1”。美洲国家组织/AC'和'2。美洲国家组织/非洲发展组织' 此文本片段具有: 1) 开始时的变化数 2) 从A到Z的两个大写字母 我试过这个: x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})") x我们可以通过积极的前瞻来实现这一点,寻找一个数字的模式,然后是一个peroid: str_split

我有一个很长的字符串,比如下面的示例bellow,我正在努力找到一个正则表达式,根据patrn将其拆分为多个部分,例如:“1”。美洲国家组织/AC'和'2。美洲国家组织/非洲发展组织'

此文本片段具有:

1) 开始时的变化数

2) 从A到Z的两个大写字母

我试过这个:

x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

x我们可以通过积极的前瞻来实现这一点,寻找一个数字的模式,然后是一个peroid:

str_split(have, "(?=\\d+\\.)")

[1] ""                                                             "1. OAS / AC 12345/this is a test string to regex, "          
[3] "2. OAS / AD     79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "      
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."
我们可以进一步清理:

str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]

[1] "1. OAS / AC 12345/this is a test string to regex, "           "2. OAS / AD     79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. "       "4. OAS / AZ 78798456/this is one mode test string to regex." 
你可以用

library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]

图案细节

  • (\d+\.OAS/[A-Z]{2})
    -捕获组1:
    • \d+
      -1+位
    • \.
      -a
    • OAS/
      -文本
      OAS/
      子字符串
    • [A-Z]{2}
      -两个大写字母
  • \s*
    -0+空格
  • (.*)
    -捕获组2:除换行符以外的任何0+字符,尽可能少
  • (?=\s*\d+\.OAS/[A-Z]{2}|\Z)
    -正向前瞻:在当前位置的右侧,必须有
    • \s*\d+\。OAS/[A-Z]{2}
      -0+空格,1+位数,
      ,空格,
      //code>,空格,两个大写字母
    • |
      -或
    • \z
      -字符串结尾

您描述问题的方式有点不清楚,但如果您只想提取“OAS/AC”

要使上述函数起作用,句子应该是字符向量中的单个字符串

如果您的目标是在两个字母的子字符串和出现在
“OAS”
之后的数字之间插入一个
“=”
符号


尝试
stringr::str\u match\u all(have,“(\\d+\\.OAS/[A-Z]{2})\\s*(.*)(=\\s*\\d+\.OAS/[A-Z]{2}\\\\Z)”
Hi@WiktorStribiżew。我远远没有得到这样的解决办法。非常感谢你的帮助。很高兴它对你有用。请考虑通过点击来接受答案。✓ 如果我的回答对你有帮助的话,请点击左边(见),并向上投票(见)。非常感谢你的帮助。那会有很大帮助。
library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]
dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,", 
#  "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
#  ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
#  ))
library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.
gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)