Regex 使用R提取特殊字符之间的值_Regex_R

Regex 使用R提取特殊字符之间的值

regex r

Regex 使用R提取特殊字符之间的值,regex,r,Regex,R,我想提取[and]之间的值，并将这些提取的值放在新的col2列中我不反对使用stringr而不是base 示例数据： df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"), d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]" ), class = "factor"))

我想提取[and]之间的值，并将这些提取的值放在新的col2列中

我不反对使用stringr而不是base

示例数据：

df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"), 
    d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]"
    ), class = "factor")), .Names = c("t", "d"), row.names = c(NA, 
-2L), class = "data.frame")

现在我想创建一个新的列df$r，并将v1的值123和v2的值456提取到df$r中

我相信使用正则表达式搜索[and]可以很容易地做到这一点，但我不擅长使用正则表达式

谢谢你的帮助

-樱桃树

df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"), 
                     d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]"
                     ), class = "factor")), .Names = c("t", "d"), row.names = c(NA, 
                                                                                -2L), class = "data.frame")

此外，如果要捕获逗号后的第二个数字字符串，这将更有用：

gsub('.*\\[(\\d+),(\\d+).*', '\\1', df$d)
# [1] "123" "456"
gsub('.*\\[(\\d+),(\\d+).*', '\\2', df$d)
# [1] "894"  "4834"

或者，如果您想一次完成这两项工作：

cbind(df, do.call('rbind', lapply(strsplit(as.character(df$d), ','),
                                  function(x) gsub('\\D', '', x))))

#    t                   d   1    2
# 1 v1  something[123,894] 123  894
# 2 v2 something[456,4834] 456 4834

比我解释得更好：

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  \[                       '['
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))

此外，如果要捕获逗号后的第二个数字字符串，这将更有用：

gsub('.*\\[(\\d+),(\\d+).*', '\\1', df$d)
# [1] "123" "456"
gsub('.*\\[(\\d+),(\\d+).*', '\\2', df$d)
# [1] "894"  "4834"

或者，如果您想一次完成这两项工作：

cbind(df, do.call('rbind', lapply(strsplit(as.character(df$d), ','),
                                  function(x) gsub('\\D', '', x))))

#    t                   d   1    2
# 1 v1  something[123,894] 123  894
# 2 v2 something[456,4834] 456 4834

比我解释得更好：

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  \[                       '['
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))

啊，你只想要第一个字符串？我两个都做了。我认为其中一个更容易df$r@rawr，这是一个很好的解决方案，尽管我想知道OP是否总是在括号内的第一个数字中有3位数字。好问题@DavidArenburg，它并不总是有3位数字……它可以从1位到4位不等。@cherrytree然后替换{3}使用a+：@rawr，我想你应该发布它啊，你只需要第一个字符串？我两个都做了。我认为$r@rawr更容易，这是一个很好的解决方案，尽管我想知道OP是否总是在括号内的第一个数字中有3个数字。好问题@Davidernburg，它并不总是有3个数字…它可以从1到4个数字不等。@cherrytree然后用+：@rawr替换{3}，我想你应该把它贴在这里