R/dplyr:使用循环创建滞后,并根据列名计算累积和
我想在一个大的ish数据帧中循环一长串列,并计算列的滞后值的累积和。换句话说,我在计算每次观察之前“做”了多少 玩具数据框,以帮助使这更清楚R/dplyr:使用循环创建滞后,并根据列名计算累积和,r,loops,dplyr,R,Loops,Dplyr,我想在一个大的ish数据帧中循环一长串列,并计算列的滞后值的累积和。换句话说,我在计算每次观察之前“做”了多少 玩具数据框,以帮助使这更清楚 id = c("a", "a", "a", "b", "b") date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days") v1 = sample(seq(1, 20), 5) v2 = sample(seq(1, 20), 5) df = data.frame(id, date,
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 = sample(seq(1, 20), 5)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)
我想让它看起来像
id date v1 v2 v1Cum v2Cum
a 2015-12-01 1 13 0 0
a 2015-12-02 7 11 1 13
a 2015-12-03 12 2 8 24
b 2015-12-04 18 6 0 0
b 2015-12-05 4 9 18 6
所以它不是id组中v1或v2的累积和,而是每个id的滞后值的累积和
我可以在单个列上这样做,这没问题,但我似乎不能用循环来概括它:
vars = c("v1", "v2")
for (var in vars) {
lagname = paste(var, "Lag", sep="")
cumname = paste(var, "Cum", sep="")
df = arrange(df, id, date)
df = df %>%
group_by(id) %>%
mutate(!!lagname := dplyr::lag(var, n = 1, default = NA))
df[[lagname]] = ifelse(is.na(df[[lagname]]), 0, df[[lagname]])
df = df %>% group_by(id) %>% arrange(date) %>% mutate(!!cumname := cumsum(!!lagname))
}
在我看来,问题是
- lag变量的计算结果仅为NA(或ifelse()后面的0)。我知道我还没有完全搞定变种()
- 累计总和计算为NA
有什么想法吗?谢谢你的帮助!(休息了几年后,我正试图重新开始编写代码。然而,我的主要“语言”是Stata,所以我想我的做法有点奇怪。很高兴完全修改它!)如果我理解正确,以下应该可以用: 可再现的样本数据(3个变量求和): 按id分组、按日期排序(以防它们不按顺序排列)以及对两个命名变量之间的所有命名变量进行变异(
v1:v3在本例中为):
df%>%
分组依据(id)%>%
安排(日期)%>%
在(变量(v1:v3)、funs(Cum=cumsum(滞后(,默认值=0)))%>
解组()
#一个tibble:5x8
#组别:id[2]
身份证日期v1 v2 v3 v1_Cum v2_Cum v3_Cum
1A 2015-12-01 61200
2A 2015-12-02 15119 6120
3A 2015-12-03 8 17 13 21 12 29
4B 2015-12-0416100
5 b 2015-12-05 17 8 2 16 10 10
我使用了与Z.Lin类似的方法
你还需要知道一件事:
您需要使用类似于UQ(rlang::sym(cumname))
的语法将字符转换为dplyr中可执行的表达式,因为dplyr使用非标准求值
library(dplyr)
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
set.seed(1)
v1 = sample(seq(1, 20), 5)
set.seed(2)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)
var_list <- c("v1","v2")
cumname <- "Cum"
df %>%
group_by(id) %>%
mutate_at(vars(one_of(var_list)),
funs(UQ(rlang::sym(cumname)) := cumsum(lag(.,default = 0)))) %>%
ungroup()
下面是一个使用data.table
的解决方案
id <- c("a", "a", "a", "b", "b")
date <- seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 <- sample(seq(1, 20), 5)
v2 <- sample(seq(1, 20), 5)
df <- data.frame(id, date, v1, v2)
df
id date v1 v2
1 a 2015-12-01 19 9
2 a 2015-12-02 3 17
3 a 2015-12-03 7 14
4 b 2015-12-04 10 15
5 b 2015-12-05 8 11
library(data.table)
tab <- as.data.table(df)[, (c("v1Cum", "v2Cum")) := lapply(.SD, function(x) {
# Shift v1 and v2.
xs <- shift(x)
# Cumulate those values, making an allowance for <NA> values created by the
# shift function.
cumsum(ifelse(is.na(xs), 0, xs))
}), by = id, .SDcols = c("v1", "v2")]
tab[]
id date v1 v2 v1Cum v2Cum
1: a 2015-12-01 19 9 0 0
2: a 2015-12-02 3 17 19 9
3: a 2015-12-03 7 14 22 26
4: b 2015-12-04 10 15 0 0
5: b 2015-12-05 8 11 10 15
id考虑一个带有ave的简单基数R
:
set.seed(22)
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 = sample(seq(1, 20), 5)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)
for (col in c("v1", "v2")) {
df[[paste0(col, "_cum")]] <- ave(df[[col]], df$id, FUN=function(x)
cumsum(c(0,x[1:(length(x)-1)])))
}
print(df)
# id date v1 v2 v1_cum v2_cum
# a 2015-12-01 7 15 0 0
# a 2015-12-02 10 12 7 15
# a 2015-12-03 18 14 17 27
# b 2015-12-04 9 8 0 0
# b 2015-12-05 14 6 9 8
set.seed(22)
id=c(“a”、“a”、“a”、“b”、“b”)
日期=序号(截止日期(“2015-12-01”)、截止日期(“2015-12-05”)、by=“天”)
v1=样本(序号(1,20,5)
v2=样本(序号(1,20,5)
df=数据帧(id、日期、v1、v2)
用于(c列(“v1”、“v2”)){
df[[paste0(col,“_cum”)]]你只需使用!!
:!!cumname:=…
哦,我以前不知道这个。那更方便,谢谢!啊--这更有意义。谢谢你的帮助!
df %>%
group_by(id) %>%
mutate_at(vars(one_of(var_list)),
funs(!!cumname := cumsum(lag(.,default = 0)))) %>%
ungroup()
id <- c("a", "a", "a", "b", "b")
date <- seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 <- sample(seq(1, 20), 5)
v2 <- sample(seq(1, 20), 5)
df <- data.frame(id, date, v1, v2)
df
id date v1 v2
1 a 2015-12-01 19 9
2 a 2015-12-02 3 17
3 a 2015-12-03 7 14
4 b 2015-12-04 10 15
5 b 2015-12-05 8 11
library(data.table)
tab <- as.data.table(df)[, (c("v1Cum", "v2Cum")) := lapply(.SD, function(x) {
# Shift v1 and v2.
xs <- shift(x)
# Cumulate those values, making an allowance for <NA> values created by the
# shift function.
cumsum(ifelse(is.na(xs), 0, xs))
}), by = id, .SDcols = c("v1", "v2")]
tab[]
id date v1 v2 v1Cum v2Cum
1: a 2015-12-01 19 9 0 0
2: a 2015-12-02 3 17 19 9
3: a 2015-12-03 7 14 22 26
4: b 2015-12-04 10 15 0 0
5: b 2015-12-05 8 11 10 15
set.seed(22)
id = c("a", "a", "a", "b", "b")
date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days")
v1 = sample(seq(1, 20), 5)
v2 = sample(seq(1, 20), 5)
df = data.frame(id, date, v1, v2)
for (col in c("v1", "v2")) {
df[[paste0(col, "_cum")]] <- ave(df[[col]], df$id, FUN=function(x)
cumsum(c(0,x[1:(length(x)-1)])))
}
print(df)
# id date v1 v2 v1_cum v2_cum
# a 2015-12-01 7 15 0 0
# a 2015-12-02 10 12 7 15
# a 2015-12-03 18 14 17 27
# b 2015-12-04 9 8 0 0
# b 2015-12-05 14 6 9 8