如何将键控数据的聚合向量附加为R中的新列_R_Dataframe_Aggregate

如何将键控数据的聚合向量附加为R中的新列

r dataframe

如何将键控数据的聚合向量附加为R中的新列,r,dataframe,aggregate,R,Dataframe,Aggregate,我的R问题如下：我有一个来自SQL数据库的data.frame（比如安全现金流：“cf_table”）。主键由3列组成： security_id, quote_date, future_cf_date, (and 'x') 在第4列（例如“x”）中，我将计算返回向量的值（而不是单个值），在我的示例中，rev（cumsum（rev（x））——后向累积和，按前两个col分组。换言之：“在报价日，证券未来现金流的后向累计金额是多少？”“x”是稀疏的，主要是NAs。我怎样才能完成这项任务？我尝试了dp

我的R问题如下：我有一个来自SQL数据库的data.frame（比如安全现金流：“cf_table”）。主键由3列组成：

security_id, quote_date, future_cf_date, (and 'x')

在第4列（例如“x”）中，我将计算返回向量的值（而不是单个值），在我的示例中，

rev（cumsum（rev（x））

——后向累积和，按前两个col分组。换言之：“在报价日，证券未来现金流的后向累计金额是多少？”“x”是稀疏的，主要是NAs。我怎样才能完成这项任务？我尝试了
dplyr
，
data.table
等，但没有成功。我的目标是将这个新列附加到原始表中
关于再现性，请参见我文章的结尾
有什么想法吗？（顺便问一下，
rev（cumsum（rev（x））
高效还是优雅？）
样本数据：

cf_table <- structure(list(security_id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), quote_date = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("2014.05.13", "2015.04.13", "2015.04.14", "2015.04.15"), class = "factor"), CF.Dátum = structure(c(3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 2L, 4L, 1L, 2L, 4L, 1L, 2L, 4L), .Label = c("2014.12.22", "2015.06.22", "2015.06.24", "2015.12.22", "2016.06.24", "2017.06.26", "2018.06.25", "2019.06.24", "2020.06.24", "2021.06.24", "2022.06.24" ), class = "factor"), future_cf_date = c(NA, NA, NA, NA, NA, 2000L, NA, 10000L, NA, NA, NA, NA, NA, 2000L, NA, 10000L, NA, NA, NA, NA, NA, NA, NA, 10000L, NA, 500L, 10000L, NA, NA, 10000L, NA, NA, 10000L), My.desired.output = c(12000L, 12000L, 12000L, 12000L, 12000L, 12000L, 10000L, 10000L, 12000L, 12000L, 12000L, 12000L, 12000L, 12000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10500L, 10500L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L, 10000L )), .Names = c("security_id", "quote_date", "future_cf_date", "x", "My.desired.output"), class = "data.frame", row.names = c(NA, -33L))

cf_table您可以使用Reduce 函数，并从向量的右侧进行累加x ，其作用就像cumsum 向后： library(dplyr) cf_table_reduce = function() cf_table %>% group_by(security_id, quote_date) %>% mutate(back_sum = Reduce(function(i,j) sum(i,j,na.rm = T), x, right = T, accumulate = T)) rev（求和（rev））的另一个选项是将x中的NA 值替换为零，因为cumsum 函数无法处理NA 值： cf_table_rev = function() cf_table %>% group_by(security_id, quote_date) %>% mutate(x = replace(x, is.na(x), 0), back_sum = rev(cumsum(rev(x)))) 结果: 至于速度，这两种方法似乎很接近： microbenchmark(cf_table_rev(), cf_table_reduce()) # Unit: milliseconds # expr min lq mean median uq max neval # cf_table_rev() 212.2586 225.9167 332.3184 410.3508 431.9465 452.0192 100 # cf_table_reduce() 211.2370 225.0572 331.7268 412.5145 432.1195 453.0889 100 我用于比较的数据维度为： dim(cf_table) # [1] 2162688 5 您可以使用Reduce 函数并从向量x 的右侧进行累加，其作用类似于cumsum 向后： library(dplyr) cf_table_reduce = function() cf_table %>% group_by(security_id, quote_date) %>% mutate(back_sum = Reduce(function(i,j) sum(i,j,na.rm = T), x, right = T, accumulate = T)) rev（求和（rev））的另一个选项是将x中的NA 值替换为零，因为cumsum 函数无法处理NA 值： cf_table_rev = function() cf_table %>% group_by(security_id, quote_date) %>% mutate(x = replace(x, is.na(x), 0), back_sum = rev(cumsum(rev(x)))) 结果: 至于速度，这两种方法似乎很接近： microbenchmark(cf_table_rev(), cf_table_reduce()) # Unit: milliseconds # expr min lq mean median uq max neval # cf_table_rev() 212.2586 225.9167 332.3184 410.3508 431.9465 452.0192 100 # cf_table_reduce() 211.2370 225.0572 331.7268 412.5145 432.1195 453.0889 100 我用于比较的数据维度为： dim(cf_table) # [1] 2162688 5 我们可以从base R 使用ave ，而无需使用任何软件包 with(cf_table, ave(replace(x, is.na(x), 0), security_id, quote_date, FUN = function(x) rev(cumsum(rev(x))))) #[1] 12000 12000 12000 12000 12000 12000 10000 10000 12000 12000 12000 12000 12000 12000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 #[25] 10500 10500 10000 10000 10000 10000 10000 10000 10000 基准 cf_ave我们可以从base R 使用ave ，而无需使用任何软件包 with(cf_table, ave(replace(x, is.na(x), 0), security_id, quote_date, FUN = function(x) rev(cumsum(rev(x))))) #[1] 12000 12000 12000 12000 12000 12000 10000 10000 12000 12000 12000 12000 12000 12000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 #[25] 10500 10500 10000 10000 10000 10000 10000 10000 10000 基准在data.table中，第二种方法不需要rev s。不知道它是否快：library（data.table）；setDT（cf_table）[.N:1，v:=cumsum（替换（x，is.na（x），0）），by=（security_id，quote_date）] 在data.table中，第二种方法不需要rev s。不知道它是否快：library（data.table）；setDT（cf_table）[.N:1，v:=cumsum（替换（x，is.na（x），0）），by=（安全id，报价日期）]