Regex 在R中重新格式化刮取的日期_Regex_R_Date

Regex 在R中重新格式化刮取的日期

regex r date

Regex 在R中重新格式化刮取的日期,regex,r,date,Regex,R,Date,我已经抓取了HTML，现在我有了这样的行： rows 1: for the Year Ended 31 March 2013 我只想提取表达式“2013年3月31日”。表达式周围的文本可能会有所不同。将表达式转换为日期格式，最好是31-3-2013 怎么办如果字符串中没有其他数字，可以使用以下方法： string <- "for the Year Ended 31 March 2013" format(as.Date(su

我已经抓取了HTML，现在我有了这样的行：

                               rows
1: for the Year Ended 31 March 2013

我只想提取表达式

“2013年3月31日”

。表达式周围的文本可能会有所不同。将表达式转换为日期格式，最好是

31-3-2013

怎么办

如果字符串中没有其他数字，可以使用以下方法：

string <- "for the Year Ended 31 March 2013"

format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string), 
               "%d %B %Y"), "%d-%m-%Y")
# [1] "31-03-2013"

行另一个选项：
library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"

日期总是最后三个字吗？还有其他的数字吗？或者你可以使用一个正则表达式来给你两个数字和中间的单词吗？问题是我还不知道，直到我为所有文件做了报废处理。但是，如果把它不可能是最后三个字的可能性包括在内，那就太好了。
rows <- c("for the Year Ended 31 March 2013 ... 31 March 2013 ...",
          "for the Year Ended 1 December 2011")
m <- gregexpr("[0-9]+ [A-z]+ [0-9]{4}", rows)
# Sys.setlocale("LC_TIME", "english")
(res <- lapply(regmatches(rows, m), as.Date, "%d %B %Y"))
# [[1]]
# [1] "2013-03-31" "2013-03-31"
# 
# [[2]]
# [1] "2011-12-01"
lapply(res, format.Date, "%d-%m-%Y") # or "%d-%e-%Y"
# [[1]]
# [1] "31-03-2013" "31-03-2013"
# 
# [[2]]
# [1] "01-12-2011"

library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"