R Data.table-分组时分组内的子集划分速度较慢

R Data.table-分组时分组内的子集划分速度较慢,r,data.table,R,Data.table,我试图生成几个聚合统计数据,其中一些需要在每个组的子集上生成。data.table相当大,有1000万行,但是使用by而不使用列子集的速度不到一秒钟。只需在每个组的子集上添加一个需要计算的额外列,即可将运行时间增加12倍。 这是一种更快的方法吗?下面是我的完整代码 library(data.table) library(microbenchmark) N = 10^7 DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),

我试图生成几个聚合统计数据,其中一些需要在每个组的子集上生成。data.table相当大,有1000万行,但是使用by而不使用列子集的速度不到一秒钟。只需在每个组的子集上添加一个需要计算的额外列,即可将运行时间增加12倍。 这是一种更快的方法吗?下面是我的完整代码

library(data.table)
library(microbenchmark)

N = 10^7

DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),
                id2 = sample(1:100, size = N, replace = TRUE),
                id3 = sample(1:50, size = N, replace = TRUE),
                filter_var = sample(1:10, size = N, replace = TRUE),
                x1 = sample(1:1000, size = N, replace = TRUE),
                x2 = sample(1:1000, size = N, replace = TRUE),
                x3 = sample(1:1000, size = N, replace = TRUE),
                x4 = sample(1:1000, size = N, replace = TRUE),
                x5 = sample(1:1000, size = N, replace = TRUE) )

setkey(DT, id1,id2,id3)

microbenchmark( 
  DT[, .(
    sum_x1 = sum(x1),
    sum_x2 = sum(x2),
    sum_x3 = sum(x3),
    sum_x4 = sum(x4),
    sum_x5 = sum(x5),
    avg_x1 = mean(x1),
    avg_x2 = mean(x2),
    avg_x3 = mean(x3),
    avg_x4 = mean(x4),
    avg_x5 = mean(x5)
  ) , by = c('id1','id2','id3')]  , unit = 's', times = 10L)
      min        lq     mean    median       uq      max neval
 0.942013 0.9566891 1.004134 0.9884895 1.031334 1.165144    10


microbenchmark(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')]  , unit = 's', times = 10L)

      min      lq     mean   median       uq      max neval
 12.24046 12.4123 12.83447 12.72026 13.49059 13.61248    10

GForce使分组操作运行得更快,并可用于listx=funxX、y=funyY等表达式。。。其中X和Y是列名,funx和funy属于优化函数集

有关工作原理的完整描述,请参见?GForce。 要测试表达式是否工作,请从DT[,expr,by=,verbose=TRUE]读取消息。 在OP的例子中,我们有sum_x1_F1=sumx1[filter_var<5],即使sumv为零,它也不被GForce覆盖。在这种特殊情况下,我们可以建立一个var v=x1*条件并求和:

DT[, v := x1*(filter_var < 5)]

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    0.63    0.19    0.81 
为了进行比较,请在我的计算机上计时OP的代码:

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    9.00    0.02    9.06 

GForce使分组操作运行得更快,并可用于listx=funxX、y=funyY等表达式。。。其中X和Y是列名,funx和funy属于优化函数集

有关工作原理的完整描述,请参见?GForce。 要测试表达式是否工作,请从DT[,expr,by=,verbose=TRUE]读取消息。 在OP的例子中,我们有sum_x1_F1=sumx1[filter_var<5],即使sumv为零,它也不被GForce覆盖。在这种特殊情况下,我们可以建立一个var v=x1*条件并求和:

DT[, v := x1*(filter_var < 5)]

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    0.63    0.19    0.81 
为了进行比较,请在我的计算机上计时OP的代码:

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    9.00    0.02    9.06 

尝试添加verbose=TRUE并读取?GForce。。如果必须进行此计算,可以先创建v:=x1*filter_var<5,然后取that@Frank非常好的建议,您应该让它回答-我的代码在一秒钟内而不是12秒钟内运行。我没有意识到在进行子集时gforce会被关闭。总是这样吗?自2016年以来,我一直没有使用data.table,我似乎记得在这种情况下,它的运行速度与预期的一样快,但我可能错了。好的,很酷,完成了。是的,我认为GForce从未涵盖过这种用法。顺便说一句,他们正在进行一个基准测试。如果您感兴趣,我想它将出现在下一个CRAN版本中。请尝试添加verbose=TRUE和reading?GForce。。如果必须进行此计算,可以先创建v:=x1*filter_var<5,然后取that@Frank非常好的建议,您应该让它回答-我的代码在一秒钟内而不是12秒钟内运行。我没有意识到在进行子集时gforce会被关闭。总是这样吗?自2016年以来,我一直没有使用data.table,我似乎记得在这种情况下,它的运行速度与预期的一样快,但我可能错了。好的,很酷,完成了。是的,我认为GForce从未涵盖过这种用法。顺便说一句,他们正在进行一个基准测试。如果您感兴趣,我想它将在下一个CRAN版本中发布。