R：提取文件名的一部分_R_Filenames

R：提取文件名的一部分

R：提取文件名的一部分,r,filenames,R,Filenames,我试图使用R提取文件名的一部分，我对如何从这里开始有一个模糊的想法：然而，我无法在我的文件名列表中使用它文件名示例： "Species Count (2011-12-15-07-09-39).xls" "Species Count 0511.xls" "Species Count 151112.xls" "Species Count1011.xls" "Species Count2012-01.xls" "Species Count201207.xls" "Species Count

我试图使用R提取文件名的一部分，我对如何从这里开始有一个模糊的想法：然而，我无法在我的文件名列表中使用它

文件名示例：

"Species Count (2011-12-15-07-09-39).xls"
"Species Count 0511.xls"
"Species Count 151112.xls" 
"Species Count1011.xls" 
"Species Count2012-01.xls" 
"Species Count201207.xls" 
"Species Count2013-01-15.xls"

有些文件名在物种计数和日期之间有空格，有些没有空格，它们的长度不同，有些包含括号。我只想提取文件名的数字部分，并保留-。例如，对于上述数据，我会：

预期产出：

这里有一个方法：

regmatches(tt, regexpr("[0-9].*[0-9]", tt))

我假设您的文件名中没有其他数字。因此，我们只需搜索一个数字的开头，并使用贪婪运算符

，以便捕获到最后一个数字之前的所有内容。这是使用

regexpr

完成的，它将获得匹配的位置。然后我们使用

regmatches

从这些匹配的位置提取（子）字符串

其中，

tt

为：

tt <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
        "Species Count 151112.xls", "Species Count1011.xls", 
        "Species Count2012-01.xls", "Species Count201207.xls", 
        "Species Count2013-01-15.xls")

tt使用函数gsub（）
删除所有字母、空格、句点和括号。然后你会留下数字和连字符。比如说,
x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
    "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", 
    "Species Count201207.xls", "Species Count2013-01-15.xls")

gsub("[A-z \\.\\(\\)]", "", x)

[1] "2011-12-15-07-09-39" "0511"                "151112"             
[4] "1011"                "2012-01"             "201207"             
[7] "2013-01-15"         

x使用stringr
包提取只有数字或后跟-
的数字的所有字符串：
library(stringr)
str_extract(ll,'([0-9]|[0-9][-])+')

[1] "2011-12-15-07-09-39" "0511"               
    "151112"              "1011"                "2012-01"            
[6] "201207"              "2013-01-15"         

如果您关心速度，可以使用带有反向引用的sub
来提取所需的部分。还要注意，perl=TRUE
通常更快（根据？grep
）
jj OT也可以试试这个正则表达式（[a-zA-Z]：（\\w+*\[a-zA-Z0\U 9]+）？.xls
，因为他所有的文件都是*.xls
，我认为你的评论应该直接放在OP的问题下，或者作为一个单独的答案。这是一个完美的方式，比我之前尝试的方式干净得多，谢谢！请注意，添加perl=TRUE
将使其运行所有这些解决方案的速度更快（gsub
具有最大的速度提升，尽管速度比其他解决方案慢）-这里没有进行基准测试。@Hansi，我在R3.0.1 Mac Mountain Lion 10.8.3上试用过，现在在Debian Linux集群R2.15.2上试用过。顺序不变（arun=2，agstudy=2.5，jean=5.5秒）。+1。这个答案没有错。除非我尽可能避免全局搜索。对于更大的数据，这会更慢。试着做tt
x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
    "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", 
    "Species Count201207.xls", "Species Count2013-01-15.xls")

gsub("[A-z \\.\\(\\)]", "", x)

[1] "2011-12-15-07-09-39" "0511"                "151112"             
[4] "1011"                "2012-01"             "201207"             
[7] "2013-01-15"         

library(stringr)
str_extract(ll,'([0-9]|[0-9][-])+')

[1] "2011-12-15-07-09-39" "0511"               
    "151112"              "1011"                "2012-01"            
[6] "201207"              "2013-01-15"         

jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE))

# Run on R-2.15.2 on 32-bit Windows
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: milliseconds
#           expr       min        lq    median        uq       max
# 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255
# 2 josh <- jj()  390.0142  390.8956  391.6431  394.5439  493.2545
identical(arun, josh)  # TRUE

# Run on R-3.0.1 on 64-bit Ubuntu
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016    25
#  josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057    25
identical(arun, josh)  # still TRUE