R 数据帧中特定值之和的高效重采样_R_Dplyr

R 数据帧中特定值之和的高效重采样

R 数据帧中特定值之和的高效重采样,r,dplyr,R,Dplyr,我的数据如下所示： df <- data.frame( x = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish", "squid", "squid", "squid"), y = c(10, 11, 6, 3, 4, 5, 5, 9, 14, 33, 16) ) 如何改进此重采样/伪引导代码？我不确定这是否更好（没有时间进行基准测试），但您可以避免这里的双循环。你可以先通过动物进行筛选（从而处理子集），

我的数据如下所示：

df <- data.frame(
    x = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish", "squid", "squid", "squid"),
    y = c(10, 11, 6, 3, 4, 5, 5, 9, 14, 33, 16)
)

如何改进此重采样/伪引导代码？

我不确定这是否更好（没有时间进行基准测试），但您可以避免这里的双循环。你可以先通过

动物

进行筛选（从而处理子集），然后对

每个组只取样一次。如果您喜欢

dplyr

，这里有一个可能的

dplyr/tidyr

版本

library(tidyr)
library(dplyr)

ani_samp <- function(animals, n){
  df %>%
    filter(x %in% animals) %>% # Work on a subset
    group_by(x) %>%
    sample_n(n, replace = TRUE) %>% # sample only once per each group
    group_by(x) %>%
    mutate(id = row_number()) %>% # Create an index for rowSums
    spread(x, y) %>% # Convert to wide format for rowSums
    mutate(res = rowSums(.[-1])) %>% # Sum everything at once
    .$res # You don't need this if you want a data.frame result instead
} 

set.seed(123) # For reproducible output
ani_samp(animals, 10)
# [1] 18 24 14 24 19 18 19 19 19 14

library（tidyr）
图书馆（dplyr）
ani_samp%
过滤器（x%在%动物中）%>%#处理子集
分组依据（x）%>%
样本n（n，replace=TRUE）%>%#每组只采样一次
分组依据（x）%>%
mutate（id=row_number（））%>%#为行和创建索引
排列（x，y）%>%#将行和转换为宽格式
mutate（res=rowSums（.[1]））%>%#立即对所有内容求和
.$res#如果您想要data.frame结果，则不需要此选项
} 
设定种子（123）#用于可复制输出
ani_samp（动物，10只）
# [1] 18 24 14 24 19 18 19 19 19 14

另一种方法是：

set.seed(123) ## for reproducibility
n <- 1000 ## number of samples for each animal
samps <- do.call(cbind, lapply(animals, function(x) {sample(df$y[df$x == x], n, replace=TRUE)}))
head(samps, 10)
##      [,1] [,2] [,3]
## [1,]   10    3    5
## [2,]    6    4    5
## [3,]   11    3    5
## [4,]    6    4    5
## [5,]    6    4    5
## [6,]   10    3    5
## [7,]   11    4    5
## [8,]    6    3    5
## [9,]   11    3    5
##[10,]   11    3    5
sum <- as.vector(samps %*% rep(1,length(animals)))
head(sum, 10)
##[1] 18 15 19 15 15 18 20 14 19 19

这也应该与样本数成比例

library(tidyr)
library(dplyr)

ani_samp <- function(animals, n){
  df %>%
    filter(x %in% animals) %>% # Work on a subset
    group_by(x) %>%
    sample_n(n, replace = TRUE) %>% # sample only once per each group
    group_by(x) %>%
    mutate(id = row_number()) %>% # Create an index for rowSums
    spread(x, y) %>% # Convert to wide format for rowSums
    mutate(res = rowSums(.[-1])) %>% # Sum everything at once
    .$res # You don't need this if you want a data.frame result instead
} 

set.seed(123) # For reproducible output
ani_samp(animals, 10)
# [1] 18 24 14 24 19 18 19 19 19 14

set.seed(123) ## for reproducibility
n <- 1000 ## number of samples for each animal
samps <- do.call(cbind, lapply(animals, function(x) {sample(df$y[df$x == x], n, replace=TRUE)}))
head(samps, 10)
##      [,1] [,2] [,3]
## [1,]   10    3    5
## [2,]    6    4    5
## [3,]   11    3    5
## [4,]    6    4    5
## [5,]    6    4    5
## [6,]   10    3    5
## [7,]   11    4    5
## [8,]    6    3    5
## [9,]   11    3    5
##[10,]   11    3    5
sum <- as.vector(samps %*% rep(1,length(animals)))
head(sum, 10)
##[1] 18 15 19 15 15 18 20 14 19 19

n <- 1000 ## number of samples for each animal
system.time(as.vector(do.call(cbind, lapply(animals, function(x) {sample(df$y[df$x == x], n, replace=TRUE)})) %*% rep(1,length(animals))))
##   user  system elapsed 
##  0.001   0.000   0.001