R 在多个列中搜索以任意字符串开头的列;计数
我有一个2百万次观测的数据集。我需要搜索多达50个字符的列,以计算(以及之后的筛选)哪些观察结果以最多20个字符串中的任何一个开始 我已经编写了代码,它返回每个字符串被发现的频率的计数;但是太慢了。在100k个观测值(9列,33个搜索字符串)上运行此操作需要2分钟,并且似乎是线性扩展的(这意味着≈完整数据集为30分钟)。我可以用SAS在几秒钟内完成这项工作,我正在一台带SSD的快速笔记本电脑上运行,所以我假设我的代码是问题所在(不是机器或问题本身) 注意:我正在将搜索字符串转换为向量,以便将其传递给R 在多个列中搜索以任意字符串开头的列;计数,r,regex,count,R,Regex,Count,我有一个2百万次观测的数据集。我需要搜索多达50个字符的列,以计算(以及之后的筛选)哪些观察结果以最多20个字符串中的任何一个开始 我已经编写了代码,它返回每个字符串被发现的频率的计数;但是太慢了。在100k个观测值(9列,33个搜索字符串)上运行此操作需要2分钟,并且似乎是线性扩展的(这意味着≈完整数据集为30分钟)。我可以用SAS在几秒钟内完成这项工作,我正在一台带SSD的快速笔记本电脑上运行,所以我假设我的代码是问题所在(不是机器或问题本身) 注意:我正在将搜索字符串转换为向量,以便将其传
apply
,从而将速度提高3倍(与嵌套sapply
)。我尝试了嵌套的apply
语句,但没有加速。作为正则表达式语法的一部分,我还在搜索字符串前加了^
,以将搜索限制在字符串的开头。我采用了完全不同的方法,但我必须能够在多个列上使用多个字符串搜索字符串的开头,并返回每个搜索字符串的计数
编辑/更新
这些解决方案比我的快得多。谢谢不幸的是,我的示例搜索字符串(无意中)具有误导性。道歉。我的实际搜索字符串的长度不同,有时是所有数字,从2到5个字符不等。我应该用一些更像:
search_strings <- c("64651","BC","654","DEF","EF","G6","F8","25","I9","J7")
search\u strings由于您并不真正关心按列计数,一个技巧是unlist()
您的data.frame。这将生成所有值的向量。然后在这个向量上,您可以使用stringr::str_count
来计算模式是否发生。然后计算计数的结果。简而言之,所有“硬”步骤都是矢量化的,您只需在搜索字符串中的条目上“循环”即可
sapply(search_strings, function(i) sum(stringr::str_count(unlist(df_to_search), i)))
# ^AB ^BC ^CD ^DE ^EF ^G6 ^F8 ^H1 ^I9 ^J7
# 394 392 387 389 359 417 397 780 378 382
编辑的完全矢量化方法-比sapply快约4-5倍
通过将所有值转换为单个字符串,每个条目由一个伪字符分隔,可以完全矢量化,例如--
search\u字符串更新
只需在substring()
上使用table
。它容易阅读,而且速度快
starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts]
##
## AB BC CD DE EF G6 F8 H1 I9 J7
## 394 392 387 389 359 417 397 780 378 382
system.time(table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts])
## user system elapsed
## 0.105 0.000 0.105
原始答案
我会这样做:
将数据.frame
值从因子
转换为字符
使用R 3.3中引入的startsWith()
表演相当快
# vector of starts you want to check
starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
# converting the data.frame to character
df_to_search[] <- lapply(df_to_search, as.character)
# searching and tabulating
colSums(vapply(starts, function(x) {
vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
}, integer(ncol(df_to_search))))
# AB BC CD DE EF G6 F8 H1 I9 J7
# 394 392 387 389 359 417 397 780 378 382
#要检查的起始向量
开始
search_strings <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
vec2 <- function() {
edited <- paste0("--", search_strings)
vec_to_search <- paste0(paste0("--", unlist(df_to_search)), collapse="")
result <- stringr::str_count(vec_to_search, edited)
names(result) <- search_strings
return(result)
}
vec2()
# AB BC CD DE EF G6 F8 H1 I9 J7
# 394 392 387 389 359 417 397 780 378 382
starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts]
##
## AB BC CD DE EF G6 F8 H1 I9 J7
## 394 392 387 389 359 417 397 780 378 382
system.time(table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts])
## user system elapsed
## 0.105 0.000 0.105
x <- factor(substr(unlist(df_to_search, use.names = FALSE), 1, 2))
setNames(tabulate(x), levels(x))[starts]
# vector of starts you want to check
starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
# converting the data.frame to character
df_to_search[] <- lapply(df_to_search, as.character)
# searching and tabulating
colSums(vapply(starts, function(x) {
vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
}, integer(ncol(df_to_search))))
# AB BC CD DE EF G6 F8 H1 I9 J7
# 394 392 387 389 359 417 397 780 378 382
starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
df_to_search[] <- lapply(df_to_search, as.character)
myfun <- function() {
colSums(vapply(starts, function(x) {
vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
}, integer(ncol(df_to_search))))
}
# user system elapsed
# 0.199 0.000 0.199
myfun_unlist <- function() {
temp <- unlist(df_to_search, use.names = FALSE)
vapply(starts, function(x) sum(startsWith(temp, x)), integer(1L))
}
# user system elapsed
# 0.245 0.000 0.245
cPakfun <- function() {
sapply(search_strings, function(i) sum(stringr::str_count(unlist(df_to_search), i)))
}
# user system elapsed
# 5.614 0.000 5.613
cPakfun2 <- function() {
edited <- paste0("--", starts)
vec_to_search <- paste0(paste0("--", unlist(df_to_search)), collapse="")
result <- stringr::str_count(vec_to_search, edited)
names(result) <- starts
return(result)
}
# user system elapsed
# 0.902 0.000 0.901
opfun <- function() {
sapply(search_strings, function(y)
apply(df_to_search, 1, function(x) {
str_detect(x, y)
})) %>% colSums()
}
# user system elapsed
# 44.988 0.000 45.078
library(microbenchmark)
## Add tabulate to the options
myfun_tabulate <- function() {
df_to_search[] <- lapply(df_to_search, as.character)
x <- factor(substr(unlist(df_to_search, use.names = FALSE), 1, 2))
setNames(tabulate(x), levels(x))[starts]
}
res <- microbenchmark(myfun_tabulate(), myfun_table(), myfun(), myfun_unlist(), cPakfun2())
# Unit: milliseconds
# expr min lq mean median uq max neval
# myfun_tabulate() 90.19794 100.2941 120.5411 102.7271 153.4527 238.6175 100
# myfun_table() 96.87556 110.1965 146.5356 154.3941 168.2660 562.4599 100
# myfun() 125.68799 127.8053 162.0679 130.0665 182.7757 577.3027 100
# myfun_unlist() 136.92772 138.4104 170.4002 140.0188 198.8845 613.7919 100
# cPakfun2() 859.22835 911.5291 940.6695 935.6335 955.3801 1154.5395 100
autoplot(res, log = FALSE)