基于多因素的Bootstrap抽样

基于多因素的Bootstrap抽样,r,dplyr,R,Dplyr,我有一个数据框,看起来像这样: Factor1 Factor2 Value 1 A 1 -0.1169027 2 B 1 0.4153005 3 B 2 -1.8824073 4 B 3 0.2627502 5 C 1 0.8822784 6 C 2 0.5011568 7 C 3 0.23

我有一个数据框,看起来像这样:

   Factor1 Factor2      Value
1        A       1 -0.1169027
2        B       1  0.4153005
3        B       2 -1.8824073
4        B       3  0.2627502
5        C       1  0.8822784
6        C       2  0.5011568
7        C       3  0.2332566
8        C       4  0.1897866
9        C       5 -1.4404080
10       C       6  0.3414159
我感兴趣的是编写一段代码,根据Factor2中不同样本的最大数量,将每个Factor1级别的样本存储在新的dataframe引导程序中

library(tidyverse)    
sampleGroups <- df %>%
        group_by(Factor1) %>%
        select(Factor1, Factor2) %>%
        summarise(n_distinct(Factor2))
    sampleGroups ## max = 6
您可以看到,Factor1=A重复了6次,Factor1=B重复了6次,但在选择了Factor1(B)中所有级别的Factor2之后,Factor1(B)中的Factor2通过重复引导,然后选择了Factor1(C)6次,因为这是找到Factor2的最高唯一级别的地方

我的真实数据集有20个级别的Factor1,以及嵌套在Factor1中的17个唯一级别的Factor2

这样的事情在R中容易实现吗?也许使用dplyr?我有一些代码,可以从Factor2中为Factor1的每个级别随机选择一个样本,但我不知道如何强制它为Factor1的每个级别选择所有级别的Factor2,并进行替换(如有必要)


这应该能奏效。其思想是按
Factor1
对数据进行分割,然后
rbind
对每次分割进行重采样,重采样的大小为原始数据集中
Factor1
的最大数量与每次分割中
Factor1
的因子数量之差

df %>%
  mutate(max_n = max(Factor2)) %>%
  split(.$Factor1) %>%
  map_dfr(~rbind(., sample_n(., if(max(.$Factor2) == mean(.$max_n)) 0 else(mean(.$max_n) - max(.$Factor2)), replace = TRUE))) %>%
  select(-max_n)

#    Factor1 Factor2   Value
# 1        A       1 -0.1169
# 2        A       1 -0.1169
# 3        A       1 -0.1169
# 4        A       1 -0.1169
# 5        A       1 -0.1169
# 6        A       1 -0.1169
# 7        B       1  0.4153
# 8        B       2 -1.8824
# 9        B       3  0.2628
# 10       B       1  0.4153
# 11       B       1  0.4153
# 12       B       1  0.4153
# 13       C       1  0.8823
# 14       C       2  0.5012
# 15       C       3  0.2333
# 16       C       4  0.1898
# 17       C       5 -1.4404
# 18       C       6  0.3414

请始终将使用过的库添加到您的代码中。@jaySf,我添加了
library(tidyverse)
Fantastic这正是我想要的!
dfBoot <- tibble(Bootstrap = integer(0), Factor1 = character(0), Factor2 = character(0))
for (i in 1:10) {
    selected <- df %>%
        group_by(Factor1) %>%
        select(Factor1, Factor2) %>%
        sample_n(1) %>%
        mutate(Bootstrap = i)
    dfBoot <- bind_rows(dfBoot, selected)
}
dfBoot
# A tibble: 30 x 3
   Bootstrap Factor1 Factor2
       <int> <chr>   <chr>  
 1         1 A       1      
 2         1 B       2      
 3         1 C       1      
 4         2 A       1      
 5         2 B       1      
 6         2 C       5      
 7         3 A       1      
 8         3 B       2      
 9         3 C       3      
10         4 A       1      
# ... with 20 more rows
dfBoot <- tibble(Bootstrap = integer(0), Factor1 = character(0), Factor2 = character(0))
for (i in 1:10) {
    selected <- df %>%
        group_by(Factor1) %>%
        select(Factor1, Factor2) %>%
        # sample with replacement this time
        sample_n(6, replace = TRUE) %>%
        mutate(Bootstrap = i)
    dfBoot <- bind_rows(dfBoot, selected)
}

# A tibble: 180 x 3
   Bootstrap Factor1 Factor2
       <int> <chr>   <chr>  
 1         1 A       1      
 2         1 A       1      
 3         1 A       1      
 4         1 A       1      
 5         1 A       1      
 6         1 A       1      
 7         1 B       1      
 8         1 B       3      
 9         1 B       2      
10         1 B       2      
# ... with 170 more rows
dfBoot
df %>%
  mutate(max_n = max(Factor2)) %>%
  split(.$Factor1) %>%
  map_dfr(~rbind(., sample_n(., if(max(.$Factor2) == mean(.$max_n)) 0 else(mean(.$max_n) - max(.$Factor2)), replace = TRUE))) %>%
  select(-max_n)

#    Factor1 Factor2   Value
# 1        A       1 -0.1169
# 2        A       1 -0.1169
# 3        A       1 -0.1169
# 4        A       1 -0.1169
# 5        A       1 -0.1169
# 6        A       1 -0.1169
# 7        B       1  0.4153
# 8        B       2 -1.8824
# 9        B       3  0.2628
# 10       B       1  0.4153
# 11       B       1  0.4153
# 12       B       1  0.4153
# 13       C       1  0.8823
# 14       C       2  0.5012
# 15       C       3  0.2333
# 16       C       4  0.1898
# 17       C       5 -1.4404
# 18       C       6  0.3414