dplyr根据是否在组中进行总结_R_Dplyr_Tidyverse

dplyr根据是否在组中进行总结

dplyr根据是否在组中进行总结,r,dplyr,tidyverse,R,Dplyr,Tidyverse,我正在寻找一种方法来制作一个汇总表，将R中某个因子的每个级别的值与该因子的所有其他级别的值进行比较。iris数据集的一个例子——我想将setosa与所有其他数据集（即，VersionColor和virginica）进行比较，然后将VersionColor与其他数据集（setosa和virginica）进行比较，最后将virignica与其他数据集（VersionColor和setosa）进行比较。在我的实际数据集中，我有许多组，所以我不想硬编码每个级别。如果可能的话，我正在寻找一个整洁的解决方案

我正在寻找一种方法来制作一个汇总表，将R中某个因子的每个级别的值与该因子的所有其他级别的值进行比较。iris数据集的一个例子——我想将setosa与所有其他数据集（即，VersionColor和virginica）进行比较，然后将VersionColor与其他数据集（setosa和virginica）进行比较，最后将virignica与其他数据集（VersionColor和setosa）进行比较。在我的实际数据集中，我有许多组，所以我不想硬编码每个级别。如果可能的话，我正在寻找一个整洁的解决方案。我想要的结果是一个汇总表，如下所示：

这里，“in group”中的“yes”是指该组中的物种（因此对于setosa，它将仅是setosa），而“no”是指不在该组中的物种（因此对于setosa，它将是杂色和维吉尼亚的组合）。

在下面的代码中，我们使用

map

对

物种的每个级别分别进行操作。对于每个迭代，我们在.group

中创建一个分组列

，标记一行是否是给定物种的成员。然后，我们按组返回所有数值列的平均值：
库（tidyverse）
唯一（如字符（虹膜$物种））%>%
设置_名称（）%>%
地图(
~iris%>%
分组依据（in.group=物种==.x）%>%
总结（跨越（其中（是数字）、平均值、.names=“平均值”{col}”），
.id=“物种”
)
#>#tibble:6 x 6
#>组中物种平均萼片长度平均萼片宽度平均花瓣长度
#>                                                  
#>1 setosa FALSE 6.26 2.87 4.91
#>2 setosa TRUE 5.01 3.43 1.46
#>3彩色假5.80 3.20 3.51
#>4彩色真5.94 2.77 4.26
#>5弗吉尼亚假5.47 3.10 2.86
#>6弗吉尼亚州真6.59 2.97 5.55
#>#…还有一个变量：mean_Petal.Width

您还可以将以下内容添加到链中，以使输出更加经济：
  mutate(Species = case_when(in.group ~ Species,
                             !in.group ~ paste("not", Species))) %>% 
  select(-in.group)

其中：
  Species        mean_Sepal.Length mean_Sepal.Width mean_Petal.Length mean_Petal.Width
1 not setosa                  6.26             2.87              4.91            1.68 
2 setosa                      5.01             3.43              1.46            0.246
3 not versicolor              5.80             3.20              3.51            1.14 
4 versicolor                  5.94             2.77              4.26            1.33 
5 not virginica               5.47             3.10              2.86            0.786
6 virginica                   6.59             2.97              5.55            2.03 

您可以将其打包为函数：
compare.groups = function(data, group) {
  
  group = ensym(group)
  
  # Get levels of group
  x = data %>% 
    distinct(!!group) %>% 
    pull(!!group) %>% 
    as.character %>% 
    set_names() 
  
  # Map over each level
  x %>% 
    map_df(
      ~ data %>% 
          group_by(in.group = !!group == .x) %>% 
          summarise(across(where(is.numeric), mean, .names="mean_{col}")),
      .id=as_label(enquo(group))
    ) %>%
    mutate(!!group := case_when(in.group ~ !!group,
                                !in.group ~ paste("not", !!group))) %>% 
    select(-in.group)
}

# Run the function on a couple of data frames
compare.groups(iris, Species)
compare.groups(diamonds, cut)

您还可以使用该函数获取数据框中所有分类列的结果：
diamonds %>% 
  select(where(~!is.numeric(.))) %>% 
  names() %>% 
  set_names() %>% 
  map_df(
    ~compare.groups(diamonds, !!.x) %>% 
      rename(category = .x),
    .id="variable"
  ) 

1）我们可以在dplyr
本身内完成这项工作。按“种类”分组，通过使用cur\u group\u id（）
将列的平均值和完整数据列的子集连接起来，从而总结出我们需要的列，即“萼片长度”（除了dplyr
，没有使用其他包）
-输出
# A tibble: 6 x 6
#  Species    InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
#  <fct>      <chr>              <dbl>           <dbl>            <dbl>           <dbl>
#1 setosa     Yes                 5.01            3.43             1.46           0.246
#2 setosa     No                  6.26            2.87             4.91           1.68 
#3 versicolor Yes                 5.94            2.77             4.26           1.33 
#4 versicolor No                  5.80            3.20             3.51           1.14 
#5 virginica  Yes                 6.59            2.97             5.55           2.03 
#6 virginica  No                  5.47            3.10             2.86           0.786


4）或者另一种选择是取总和的差值
除以行数的差值
iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), across(where(is.numeric), 
       ~  c(mean(.), (sum(iris[[cur_column()]]) - 
           sum(.))/(nrow(iris) - n())), .names = 'Mean{.col}'), .groups = 'drop')

iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), 
       across(where(is.numeric),  ~ c(mean(.),
       mean(iris[[cur_column()]][
         as.numeric(iris$Species) != cur_group_id()])), .names = 'Mean{.col}'),
          .groups = 'drop')

# A tibble: 6 x 6
#  Species    InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
#  <fct>      <chr>              <dbl>           <dbl>            <dbl>           <dbl>
#1 setosa     Yes                 5.01            3.43             1.46           0.246
#2 setosa     No                  6.26            2.87             4.91           1.68 
#3 versicolor Yes                 5.94            2.77             4.26           1.33 
#4 versicolor No                  5.80            3.20             3.51           1.14 
#5 virginica  Yes                 6.59            2.97             5.55           2.03 
#6 virginica  No                  5.47            3.10             2.86           0.786

f1 <- function(dat, grp) {
    grp_str <- rlang::as_string(rlang::ensym(grp))
    dat %>%
       group_by({{grp}}) %>%
       summarise(InGroup = c('Yes', 'No'),
            across(where(is.numeric), ~ c(mean(.),
                 mean(dat[[cur_column()]][
                      as.numeric(dat[[grp_str]]) != cur_group_id()])),
                   .names = 'Mean{.col}'), .groups = 'drop')
    }

f1(iris, Species)
# A tibble: 6 x 6
#  Species    InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
#  <fct>      <chr>              <dbl>           <dbl>            <dbl>           <dbl>
#1 setosa     Yes                 5.01            3.43             1.46           0.246
#2 setosa     No                  6.26            2.87             4.91           1.68 
#3 versicolor Yes                 5.94            2.77             4.26           1.33 
#4 versicolor No                  5.80            3.20             3.51           1.14 
#5 virginica  Yes                 6.59            2.97             5.55           2.03 
#6 virginica  No                  5.47            3.10             2.86           0.786

f1(diamonds, cut)
# A tibble: 10 x 9
#   cut       InGroup Meancarat Meandepth Meantable Meanprice Meanx Meany Meanz
#   <ord>     <chr>       <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <dbl>
# 1 Fair      Yes         1.05       64.0      59.1     4359.  6.25  6.18  3.98
# 2 Fair      No          0.790      61.7      57.4     3920.  5.72  5.72  3.53
# 3 Good      Yes         0.849      62.4      58.7     3929.  5.84  5.85  3.64
# 4 Good      No          0.793      61.7      57.3     3933.  5.72  5.72  3.53
# 5 Very Good Yes         0.806      61.8      58.0     3982.  5.74  5.77  3.56
# 6 Very Good No          0.796      61.7      57.3     3919.  5.73  5.72  3.53
# 7 Premium   Yes         0.892      61.3      58.7     4584.  5.97  5.94  3.65
# 8 Premium   No          0.766      61.9      57.0     3709.  5.65  5.66  3.50
# 9 Ideal     Yes         0.703      61.7      56.0     3458.  5.51  5.52  3.40
#10 Ideal     No          0.861      61.8      58.5     4249.  5.88  5.88  3.63

iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), across(where(is.numeric), 
       ~  c(mean(.), (sum(iris[[cur_column()]]) - 
           sum(.))/(nrow(iris) - n())), .names = 'Mean{.col}'), .groups = 'drop')