R Data.table:在组上应用函数,并参考每个组中的设置值。将结果列传递给函数

R Data.table:在组上应用函数,并参考每个组中的设置值。将结果列传递给函数,r,ggplot2,data.table,grouping,R,Ggplot2,Data.table,Grouping,我有按地理位置分组的长格式数据。我想计算每组中一个感兴趣的变量与所有其他感兴趣的变量之间的差异。我无法在一个数据表语句中高效地完成这项工作,所以我做了一个变通方法,在这个过程中还引入了一些新错误(我用更多变通方法修复了这些错误,但在此也将非常感谢您的帮助!) 然后,我想将结果列传递到ggplot函数中,但是无法使推荐的方法起作用,因此我使用了不推荐使用的方法 library(data.table) library(ggplot2) set.seed(1) results <- data.

我有按地理位置分组的长格式数据。我想计算每组中一个感兴趣的变量与所有其他感兴趣的变量之间的差异。我无法在一个数据表语句中高效地完成这项工作,所以我做了一个变通方法,在这个过程中还引入了一些新错误(我用更多变通方法修复了这些错误,但在此也将非常感谢您的帮助!)

然后,我想将结果列传递到ggplot函数中,但是无法使推荐的方法起作用,因此我使用了不推荐使用的方法

library(data.table)
library(ggplot2)

set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
                      variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
                      statistic = rnorm(16) )

> results[c(1:4,13:16)]
   geography variable   statistic
1:         1    alpha -0.62645381
2:         1    bravo  0.18364332
3:         1  charlie -0.83562861
4:         1    delta  1.59528080
5:         4    alpha -0.62124058
6:         4    bravo -2.21469989
7:         4  charlie  1.12493092
8:         4    delta -0.04493361

base_variable <- "alpha"
库(data.table)
图书馆(GG2)
种子(1)
结果[c(1:4,13:16)]
地理变量统计
1:1α-0.62645381
2:1布拉沃0.18364332
3:1查理-0.83562861
4:1 delta 1.59528080
5:4α-0.62124058
6:4布拉沃-2.21469989
7:4查理1.12493092
8:4δ-0.04493361

基本变量一个选项是,通过创建基于“变量”的逻辑条件,并将“基本变量”元素按“地理”分组,从而将“统计”子集化

results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070

这种事情也可以通过连接来完成。根据我的经验,“子集变量+分组”方法对于较小的表(如本例)通常更快,而当有数百万行时,联接方法更快

results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][]

#     geography variable   statistic       diff
#  1:         1    bravo  0.18364332  0.8100971
#  2:         1  charlie -0.83562861 -0.2091748
#  3:         1    delta  1.59528080  2.2217346
#  4:         2    bravo -0.82046838 -1.1499762
#  5:         2  charlie  0.48742905  0.1579213
#  6:         2    delta  0.73832471  0.4088169
#  7:         3    bravo -0.30538839 -0.8811697
#  8:         3  charlie  1.51178117  0.9359998
#  9:         3    delta  0.38984324 -0.1859381
# 10:         4    bravo -2.21469989 -1.5934593
# 11:         4  charlie  1.12493092  1.7461715
# 12:         4    delta -0.04493361  0.5763070
两个基准

library(microbenchmark)
microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)

# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#  use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357    10  a 
#   use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916    10   b

results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]

microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)


# Unit: milliseconds
#       expr      min        lq      mean    median        uq      max neval cld
#  use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996    10   b
#   use_join 19.88511  21.86214  26.22012  25.82972  29.29885  36.0853    10  a 
库(微基准)
微基准(
使用组=
结果[,(变量,差异=统计-统计[变量==基本变量],
by=geography][variable!=base_variable],
使用_join=
结果[变量!=基本变量
][结果[variable==base_variable],on='geography',
差异:=统计-i.统计][],
次数=10
)
#单位:毫秒
#expr最小lq平均uq最大neval cld
#使用组1.624204 1.801434 2.143670 2.212306 2.391793 2.654357 10 a
#使用_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916 10 b

结果哇-你在几分钟内完成了我花了一个多小时的工作,用另一列对统计数据进行分组是非常有意义的(而且比我预想的某种滞后/移位解决方案要好)。谢谢=)
plott <- function(dataset, varx, vary, fillby) {
  # varx <- ensym(varx)
  # vary <- ensym(vary)
  # vary <- ensym(fillby)
  ggplot(dataset, 
         aes_string(x = varx, y = vary, color = fillby)) + 
    geom_point()
}

plott(dataset = final_result,
      varx = "geography",
      vary = "value",
      fillby = "variable")

# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby. 
Did you mistype the name of a data column or forget to add stat()?
results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][]

#     geography variable   statistic       diff
#  1:         1    bravo  0.18364332  0.8100971
#  2:         1  charlie -0.83562861 -0.2091748
#  3:         1    delta  1.59528080  2.2217346
#  4:         2    bravo -0.82046838 -1.1499762
#  5:         2  charlie  0.48742905  0.1579213
#  6:         2    delta  0.73832471  0.4088169
#  7:         3    bravo -0.30538839 -0.8811697
#  8:         3  charlie  1.51178117  0.9359998
#  9:         3    delta  0.38984324 -0.1859381
# 10:         4    bravo -2.21469989 -1.5934593
# 11:         4  charlie  1.12493092  1.7461715
# 12:         4    delta -0.04493361  0.5763070
library(microbenchmark)
microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)

# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#  use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357    10  a 
#   use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916    10   b

results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]

microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)


# Unit: milliseconds
#       expr      min        lq      mean    median        uq      max neval cld
#  use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996    10   b
#   use_join 19.88511  21.86214  26.22012  25.82972  29.29885  36.0853    10  a