R 如何使用data.table计算日内数据每天的滚动分位数
我想用数据表计算一个滚动分位数,它包含了几个组的数据,对于每个组,我有多天的时间,在每一天内我有多个观察。我不想计算表中每一个观察值的滚动分位数,但总是获取最后,比如说3天的数据,计算一个分位数,然后继续 我有这样的数据:R 如何使用data.table计算日内数据每天的滚动分位数,r,data.table,quantile,rollapply,R,Data.table,Quantile,Rollapply,我想用数据表计算一个滚动分位数,它包含了几个组的数据,对于每个组,我有多天的时间,在每一天内我有多个观察。我不想计算表中每一个观察值的滚动分位数,但总是获取最后,比如说3天的数据,计算一个分位数,然后继续 我有这样的数据: test2 <- data.table(group = rep(c("a", "b"), each = 100), date = rep(rep(seq(from = as.Date('2017-01-01'),
test2 <- data.table(group = rep(c("a", "b"), each = 100),
date = rep(rep(seq(from = as.Date('2017-01-01'),
as.Date('2017-01-10'),
by = "day"), each = 10), 2),
time = rep(rep(seq(from = 1, 10, by = 1), times = 10), 2),
some_data = rnorm(200) + c(1:20, 20:1, 30:1, 1:30, 30:1, 1:20, 20:1, 1:30))
tests_result <- test2[, list(date = date,
q_30 = rollapply(some_data,
30, quantile,
probs = 0.3,
fill = NA, align = "right")),
by = "group"][seq(from = 10, to = 200, by = 10)]
group date q_30
1: a 2017-01-01 NA
2: a 2017-01-02 NA
3: a 2017-01-03 10.284081
4: a 2017-01-04 8.281827
5: a 2017-01-05 8.281827
6: a 2017-01-06 8.281827
7: a 2017-01-07 10.274793
8: a 2017-01-08 4.749455
9: a 2017-01-09 4.749455
10: a 2017-01-10 9.246267
11: b 2017-01-01 NA
12: b 2017-01-02 NA
13: b 2017-01-03 10.145996
14: b 2017-01-04 5.423782
15: b 2017-01-05 5.423782
16: b 2017-01-06 9.741683
17: b 2017-01-07 10.123940
18: b 2017-01-08 4.347293
19: b 2017-01-09 4.347293
20: b 2017-01-10 9.177718
总结挑战:
test2 <- data.table(group = rep(c("a", "b"), each = 100),
date = rep(rep(seq(from = as.Date('2017-01-01'),
as.Date('2017-01-10'),
by = "day"), each = 10), 2),
time = rep(rep(seq(from = 1, 10, by = 1), times = 10), 2),
some_data = rnorm(200) + c(1:20, 20:1, 30:1, 1:30, 30:1, 1:20, 20:1, 1:30))
tests_result <- test2[, list(date = date,
q_30 = rollapply(some_data,
30, quantile,
probs = 0.3,
fill = NA, align = "right")),
by = "group"][seq(from = 10, to = 200, by = 10)]
group date q_30
1: a 2017-01-01 NA
2: a 2017-01-02 NA
3: a 2017-01-03 10.284081
4: a 2017-01-04 8.281827
5: a 2017-01-05 8.281827
6: a 2017-01-06 8.281827
7: a 2017-01-07 10.274793
8: a 2017-01-08 4.749455
9: a 2017-01-09 4.749455
10: a 2017-01-10 9.246267
11: b 2017-01-01 NA
12: b 2017-01-02 NA
13: b 2017-01-03 10.145996
14: b 2017-01-04 5.423782
15: b 2017-01-05 5.423782
16: b 2017-01-06 9.741683
17: b 2017-01-07 10.123940
18: b 2017-01-08 4.347293
19: b 2017-01-09 4.347293
20: b 2017-01-10 9.177718
test2 <- data.table(group = rep(c("a", "b"), each = 100),
date = rep(rep(seq(from = as.Date('2017-01-01'),
as.Date('2017-01-10'),
by = "day"), each = 10), 2),
time = rep(rep(seq(from = 1, 10, by = 1), times = 10), 2),
some_data = rnorm(200) + c(1:20, 20:1, 30:1, 1:30, 30:1, 1:20, 20:1, 1:30))
tests_result <- test2[, list(date = date,
q_30 = rollapply(some_data,
30, quantile,
probs = 0.3,
fill = NA, align = "right")),
by = "group"][seq(from = 10, to = 200, by = 10)]
group date q_30
1: a 2017-01-01 NA
2: a 2017-01-02 NA
3: a 2017-01-03 10.284081
4: a 2017-01-04 8.281827
5: a 2017-01-05 8.281827
6: a 2017-01-06 8.281827
7: a 2017-01-07 10.274793
8: a 2017-01-08 4.749455
9: a 2017-01-09 4.749455
10: a 2017-01-10 9.246267
11: b 2017-01-01 NA
12: b 2017-01-02 NA
13: b 2017-01-03 10.145996
14: b 2017-01-04 5.423782
15: b 2017-01-05 5.423782
16: b 2017-01-06 9.741683
17: b 2017-01-07 10.123940
18: b 2017-01-08 4.347293
19: b 2017-01-09 4.347293
20: b 2017-01-10 9.177718
我想出了一种方法来处理数据集的大小。但是我认为它仍然可以改进,所以,如果你有任何建议,我想听听
我对样本数据集的处理方法如下所示:
首先计算随后每3天的观察总数,同时计算给定一天中最后一次观察的原始数据集中的行数。这些新变量将在第3行和原始行中被称为
test3 <- test2[, list(.N, orig_row = .I[.N]), by = c("group", "date")][, list(date,in_3 = rollapply(N, 3, sum, fill = NA, align = "right"),
orig_row),
by = "group"]
最后,分配给聚合数据集
test3[, `:=`(q03 = quantiles)]
我也试着并行运行,但后来我的笔记本电脑的物理内存用完了,开始向磁盘写入太多内容,这比仅用一个内核运行要慢得多。您的预期输出是什么?对于特定天数的滚动分位数是什么意思?这不是你的“理论”代码所做的。这就是为什么在代码之后我说,然后每天做最后一次观察。我希望数据集中的每个日期都有一个数字,这将是根据当天的观察值计算出的分位数,+前几天的观察值。啊,你是对的,我提供的代码作为示例,我想要的,实际上并不完全符合我的想法。调整很快就会到来moment@mtoto我的不好,只是现在预期的结果是正确的。