在R中使用boot:：boot（）函数和分组变量_R_Dplyr_Statistics Bootstrap

在R中使用boot:：boot（）函数和分组变量

在R中使用boot:：boot（）函数和分组变量,r,dplyr,statistics-bootstrap,R,Dplyr,Statistics Bootstrap,这是一个关于对分组变量使用boot（）函数的问题，也是关于将多列数据传递到boot中的问题。几乎所有boot（）函数的示例似乎都通过一列数据来计算简单的平均值引导我的具体分析是尝试使用stats:：weighted.mean（x，w）函数，该函数使用值的向量“x”来计算平均值，并使用第二个向量“w”来计算权重。主要的一点是，我需要两个输入到这个函数中，我希望这个解决方案可以推广到任何具有多个参数的函数我还在寻找一个解决方案，在带有group_by（）变量的dplyr样式工作流中使用这个wei

这是一个关于对分组变量使用boot（）函数的问题，也是关于将多列数据传递到boot中的问题。几乎所有boot（）函数的示例似乎都通过一列数据来计算简单的平均值引导

我的具体分析是尝试使用stats:：weighted.mean（x，w）函数，该函数使用值的向量“x”来计算平均值，并使用第二个向量“w”来计算权重。主要的一点是，我需要两个输入到这个函数中，我希望这个解决方案可以推广到任何具有多个参数的函数

我还在寻找一个解决方案，在带有group_by（）变量的dplyr样式工作流中使用这个weighted.means函数。如果答案是“不能用dplyr完成”，那没关系，我只是想弄明白

下面我模拟了一个数据集，其中包含三个组（a、B、C），每个组都有不同的计数范围。我还尝试提出一个函数“my.function”，用于引导加权平均值。这可能是我的第一个错误：这就是我如何设置一个函数来将数据的“count”和“weight”列传递到每个引导样本中的方法吗？是否有其他方法来索引数据

在summary（）调用中，我使用“.”引用原始数据-可能是另一个错误

最终结果表明，我能够使用mean（）和weighted.mean（）实现适当的分组计算，但是使用boot（）调用置信区间，却计算了数据集全局平均值周围的95%置信区间

关于我做错了什么的建议？为什么boot（）函数引用的是整个数据集，而不是分组的子集

library(tidyverse)
library(boot)


set.seed(20)

sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
                         counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
                         weights = sample(10,150, replace = TRUE))



##Define function to bootstrap
  ##I'm using stats::weighted.mean() which needs to take in two arguments

##############
my.function = function(data,index){

  d = data[index,]  #create bootstrap sample of all columns of original data?
  return(weighted.mean(d$counts, d$weights))  #calculate weighted mean using 'counts' and 'weights' columns
  
}

##############

## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits

## I pass data to boot using "." thinking that this would only pass each grouped subset of data 
  ##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset. 

sample.data %>% 
  group_by(letter) %>% 
  summarise(avg = mean(counts),
            wtd.avg = weighted.mean(counts, weights),
            CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
            CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])

下面，我粗略估计了全球平均值的95%置信区间，以表明这就是我在上面的summary（）调用中boot（）的情况

#Here is a rough 95% confidence interval estimate as +/-  1.96* Standard Error


mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))

下面的base R解决方案解决了按组引导的问题。请注意，

boot:：boot

只调用一次

library(boot)

sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
  wtd.avg <- weighted.mean(x$counts, x$weights)
  basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
  CI.LL <- basic[4]
  CI.UL <- basic[5]
  data.frame(wtd.avg, CI.LL, CI.UL)
})

do.call(rbind, y)
#   wtd.avg    CI.LL    CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872

dplyr

解决方案可以如下所示。它还从package

purr

调用

map\u-dfr

library(boot)
library(dplyr)

sample.data %>%
  group_split(letter) %>% 
  purrr::map_dfr(
    function(x){
      wtd.avg <- weighted.mean(x$counts, x$weights)
      basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
      CI.LL <- basic[4]
      CI.UL <- basic[5]
      data.frame(wtd.avg, CI.LL, CI.UL)
    }
  )
#   wtd.avg    CI.LL    CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

库（启动）
图书馆（dplyr）
sample.data%>%
分组（字母）%>%
purrr:：map\u dfr(
功能（x）{
平均值
library(boot)
library(dplyr)

sample.data %>%
  group_split(letter) %>% 
  purrr::map_dfr(
    function(x){
      wtd.avg <- weighted.mean(x$counts, x$weights)
      basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
      CI.LL <- basic[4]
      CI.UL <- basic[5]
      data.frame(wtd.avg, CI.LL, CI.UL)
    }
  )
#   wtd.avg    CI.LL    CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872