Dplyr在没有所有数据的情况下计算均值和方差

Dplyr在没有所有数据的情况下计算均值和方差,r,dplyr,R,Dplyr,我有一个如下所示的数据集: set.seed(50) n <- 20 s_num <- c(10,20,30) counts <- c(0,1,2,3,4) strata <- sample(s_num, n, replace=T) sites <- seq(1, n, by=1) observed <- sample(counts, n, replace=T) df <- as.data.frame(cbind(strata,sites,obser

我有一个如下所示的数据集:

set.seed(50)
n <- 20
s_num <- c(10,20,30)
counts <- c(0,1,2,3,4)

strata <- sample(s_num, n, replace=T)
sites <- seq(1, n, by=1)
observed <- sample(counts, n, replace=T)

df <- as.data.frame(cbind(strata,sites,observed))
set.seed(50)

n我们可以为子集创建一个逻辑条件

df %>%
   mutate(ind = observed != 0) %>%
   group_by(strata) %>%
   summarise(mcount = mean(observed[ind]), varcount = var(observed[ind]))
# A tibble: 3 x 3
#  strata mcount varcount
#   <dbl>  <dbl>    <dbl>
#1     10   1.89    0.861
#2     20   1.6     0.8  
#3     30   3       0.667
df%>%
突变(ind=观察到的!=0)%>%
组别(阶层)%>%
总结(mcount=平均值(观察[ind]),varcount=var(观察[ind]))
#一个tibble:3x3
#地层mcount varcount
#         
#1     10   1.89    0.861
#2     20   1.6     0.8  
#3     30   3       0.667

注意:不建议使用
as.data.frame(cbind
,因为
cbind
可以将其转换为
matrix
(矩阵只能容纳一个类),这将导致所有列
factor
character
as.data.frame
(如果有任何字符列)一起使用
data.frame(地层、场地、观测)

一旦计算了
计数图
,您就可以从公式中手动计算平均值和方差

方差计算为
sum((x-均值(x))^2)/(长度(x)-1)


您可以将
过滤器添加到管道中

df2 <- df %>%
 filter(observed != 0) %>%
 group_by(strata) %>%
 summarise(mcount = mean(observed),
          varcount = var(observed))
df2%
过滤器(观察到的!=0)%>%
组别(阶层)%>%
总结(mcount=平均值(观察值),
varcount=var(观察值))

这样,您就不需要创建中间数据帧。

这一个更优雅。据我所知,问题是如何在不使用
df的情况下计算均值和方差。
是的,对不起,我不清楚。在这种情况下,我没有原始的“df”。太好了,谢谢。我不确定sum(())对于每一行都有效,但这是有效的。
df4 <- df3 %>%
  group_by(strata) %>%
  summarise(mcount = mean(observed),
            varcount = var(observed))
site_count <- df %>%
  group_by(strata) %>%
  summarise(count_plot = n_distinct(sites))
df %>%
   mutate(ind = observed != 0) %>%
   group_by(strata) %>%
   summarise(mcount = mean(observed[ind]), varcount = var(observed[ind]))
# A tibble: 3 x 3
#  strata mcount varcount
#   <dbl>  <dbl>    <dbl>
#1     10   1.89    0.861
#2     20   1.6     0.8  
#3     30   3       0.667
df3 %>% 
  left_join(site_count) %>% 
  group_by(strata) %>%
  summarise(N        = unique(count_plot),
            mcount   = sum(observed)/N,
            varcount = sum((observed - mcount)^2, (N - n())*mcount^2)/(N - 1)) %>% 
  select(-N)


# # A tibble: 3 x 3
#   strata mcount varcount
#    <dbl>  <dbl>    <dbl>
# 1   10.0   1.89    0.861
# 2   20.0   1.33    1.07 
# 3   30.0   2.40    2.30 
df2

# A tibble: 3 x 3
  strata mcount varcount
   <dbl>  <dbl>    <dbl>
1   10.0   1.89    0.861
2   20.0   1.33    1.07 
3   30.0   2.40    2.30 
df2 <- df %>%
 filter(observed != 0) %>%
 group_by(strata) %>%
 summarise(mcount = mean(observed),
          varcount = var(observed))