R 自动计算数据帧的摘要统计信息并创建新表

R 自动计算数据帧的摘要统计信息并创建新表,r,dplyr,R,Dplyr,我有以下数据帧: col1 <- c("avi","chi","chi","bov","fox","bov","fox","avi","bov", "chi","avi","chi","chi","bov","bov","fox","avi","bov","chi") col2 <- c("low","med","high","high","low","low","med","med","med","high", "low","low","hi

我有以下数据帧:

col1 <- c("avi","chi","chi","bov","fox","bov","fox","avi","bov",
          "chi","avi","chi","chi","bov","bov","fox","avi","bov","chi")
col2 <- c("low","med","high","high","low","low","med","med","med","high",
          "low","low","high","high","med","med","low","low","med")
col3 <- c(0,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0)

test_data <- cbind(col1, col2, col3)
test_data <- as.data.frame(test_data)
阻力百分比列基于上述col3,其中1=阻力,0=非阻力。我尝试了以下方法:

library(dplyr)
test_data<-test_data %>%
  count(col1,col2,col3) %>%
  group_by(col1, col2) %>%
  mutate(perc_res = prop.table(n)*100)
binom.test(resistant samples,total samples)$conf.int*100
然而,我不知道如何与其他人一起实施它。 有没有一种简单快捷的方法可以做到这一点?

应该这样做

library(tidyverse)
library(broom)

test_data %>%
  mutate(col3 = ifelse(col3 == 0, "NonResistant", "Resistant")) %>%
  count(col1, col2, col3) %>%
  spread(col3, n, fill = 0) %>%
  mutate(PercentResistant = Resistant / (NonResistant + Resistant)) %>%
  mutate(test = map2(Resistant, NonResistant, ~ binom.test(.x, .x + .y) %>% tidy())) %>%
  unnest() %>%
  transmute(Species = col1, Pop.density = col2, PercentResistant, CI_low = conf.low * 100, CI_high = conf.high * 100, TotalSamples = Resistant + NonResistant)
  • 改变0/1电阻列,使其具有可读值
  • 计算每个存储桶中的值
  • 将col3/n分为两列,电阻/非电阻,并将计数(n)放入这些列中。现在,每一行都有了进行测试所需的一切
  • 计算阻力的百分比
  • 对每个bucket执行测试,并将结果放入名为
    test
    的嵌套框架中
  • 取消测试数据帧,以便处理测试结果
  • 清理干净,给所有东西起个好名字
  • 结果

    我想

    library(data.table)
    setDT(DT)
    
    DT[, { 
      bt <- binom.test(sum(resists), .N)$conf.int*100
      .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)
    }, keyby=.(species, popdens)]
    
        species popdens  res_rate    res_lo    res_hi n
     1:     avi     low   0.00000  0.000000  70.75982 3
     2:     avi     med   0.00000  0.000000  97.50000 1
     3:     bov     low 100.00000 15.811388 100.00000 2
     4:     bov     med  50.00000  1.257912  98.74209 2
     5:     bov    high 100.00000 15.811388 100.00000 2
     6:     chi     low   0.00000  0.000000  97.50000 1
     7:     chi     med  50.00000  1.257912  98.74209 2
     8:     chi    high  66.66667  9.429932  99.15962 3
     9:     fox     low   0.00000  0.000000  97.50000 1
    10:     fox     med  50.00000  1.257912  98.74209 2
    

    我建议先使用group_by,然后使用summarise函数。使用
    data.frame(col1,col2,col3)
    ,而不是
    cbind
    ,这会强制每个列在此处字符串。示例数据没有(“avi”,“high”)对。您是否希望该行以任何方式显示(使用NAs和零样本计数)?如果它们不存在,我不需要它们显示。伟大的解决方案!我能问一下
    conf.low
    是从哪里来的吗
    unest()
    我只看到
    estimate
    statistic
    ?@PoGibas:conf.low来自
    tidy()
    ,然后是
    unest
    ed。如果你看到estimate,它应该在那里。胡乱猜测,您的窗口没有那么宽,结果下面有“…多X个变量”?aaa,它是
    tbl_df
    %>%data.frame()
    显示它。无法习惯tibble,这太棒了!哪一部分来自“扫帚”包装?它是最新的/嵌套和转换吗?@Haakonkas:broom使用
    tidy()
    方法将模型转换为数据帧。
    library(data.table)
    setDT(DT)
    
    DT[, { 
      bt <- binom.test(sum(resists), .N)$conf.int*100
      .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)
    }, keyby=.(species, popdens)]
    
        species popdens  res_rate    res_lo    res_hi n
     1:     avi     low   0.00000  0.000000  70.75982 3
     2:     avi     med   0.00000  0.000000  97.50000 1
     3:     bov     low 100.00000 15.811388 100.00000 2
     4:     bov     med  50.00000  1.257912  98.74209 2
     5:     bov    high 100.00000 15.811388 100.00000 2
     6:     chi     low   0.00000  0.000000  97.50000 1
     7:     chi     med  50.00000  1.257912  98.74209 2
     8:     chi    high  66.66667  9.429932  99.15962 3
     9:     fox     low   0.00000  0.000000  97.50000 1
    10:     fox     med  50.00000  1.257912  98.74209 2
    
    DT[CJ(species = species, popdens = popdens, unique = TRUE), on=.(species, popdens), {
      bt <- 
        if (.N > 0L) binom.test(sum(resists), .N)$conf.int*100 
        else NA_real_
      .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)    
    }, by=.EACHI]
    
        species popdens  res_rate    res_lo    res_hi n
     1:     avi     low   0.00000  0.000000  70.75982 3
     2:     avi     med   0.00000  0.000000  97.50000 1
     3:     avi    high        NA        NA        NA 0
     4:     bov     low 100.00000 15.811388 100.00000 2
     5:     bov     med  50.00000  1.257912  98.74209 2
     6:     bov    high 100.00000 15.811388 100.00000 2
     7:     chi     low   0.00000  0.000000  97.50000 1
     8:     chi     med  50.00000  1.257912  98.74209 2
     9:     chi    high  66.66667  9.429932  99.15962 3
    10:     fox     low   0.00000  0.000000  97.50000 1
    11:     fox     med  50.00000  1.257912  98.74209 2
    12:     fox    high        NA        NA        NA 0
    
    DT = data.frame(
      species = col1, 
      popdens = factor(col2, levels=c("low", "med", "high")), 
      resists = col3
    )