R 如何为数据帧中的每一行提取句子中的最后3个元素？_R_Regex_String_Dataframe

R 如何为数据帧中的每一行提取句子中的最后3个元素？

r regex string dataframe

R 如何为数据帧中的每一行提取句子中的最后3个元素？,r,regex,string,dataframe,R,Regex,String,Dataframe,我有以下数据帧： df <- structure(list(matrix.unlist.all_dates...nrow...230..byrow...T. = c( "Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Frankfurt am Main, 14 December

我有以下数据帧：

df <- structure(list(matrix.unlist.all_dates...nrow...230..byrow...T. = c(
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  14 December 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  2 November 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Paris,  19 October 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  5 October 2000", 
"Willem F. Duisenberg,  President of the European Central Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  14 September 2000", 
"Willem F. Duisenberg,  President of the European Central Bank,  Lucas Papademos,  Vice-President of the European Central Bank,  Frankfurt,  10 July 2003.", 
"Willem F. Duisenberg,  President of the European Central Bank,  Lucas Papademos,  Vice-President of the European Central Bank,    Frankfurt,  5 June 2003."
)), class = "data.frame", row.names = c(NA, -7L))

正如您可以看到的，每行中的文本都遵循一个清晰的模式，最后三个单词是日期。我只想从每一行中提取这三个单词，基本上就是日期

你会怎么做？我尝试了substr，但由于每行的长度不同，我没有成功。

您可以使用正则表达式提取日期

gsub(".* (\\d+ \\w+ \\d+)\\.?$", "\\1", df[, 1])

模式\\d+\\w+\\d+匹配

一个或多个数字\\d+，后跟一个空格，后面跟着一个或多个字母\\w+，后跟一个空格，后面跟着一个或多个数字\\d+。因此，在括号内，您可以捕获日期。

然后用日期替换整个字符串：\\1表示括号内匹配的内容。

可以使用正则表达式提取日期

gsub(".* (\\d+ \\w+ \\d+)\\.?$", "\\1", df[, 1])

模式\\d+\\w+\\d+匹配

一个或多个数字\\d+，后跟一个空格，后面跟着一个或多个字母\\w+，后跟一个空格，后面跟着一个或多个数字\\d+。因此，在括号内，您可以捕获日期。

然后用日期替换整个字符串：\\1表示括号内匹配的内容。

一个选项是使用属于tidyverse world的软件包stringr中的word函数直接选择最后三个单词

library(stringr)
str_replace_all(word(df[,1], -3, -1), fixed("."), "")
# [1] "14 December 2000"  "2 November 2000"   "19 October 2000"   "5 October 2000"    "14 September 2000" "10 July 2003"      "5 June 2003"

str_replace_all函数用于替换字符串末尾可能出现的点。fixed helper函数表明。是实际的点，不是正则表达式标记。

一个选项是使用属于tidyverse world的软件包stringr中的word函数直接选择最后三个单词

library(stringr)
str_replace_all(word(df[,1], -3, -1), fixed("."), "")
# [1] "14 December 2000"  "2 November 2000"   "19 October 2000"   "5 October 2000"    "14 September 2000" "10 July 2003"      "5 June 2003"

str_replace_all函数用于替换字符串末尾可能出现的点。fixed helper函数表明。是实际的点，不是正则表达式标记