R 计算自上次事件以来经过的时间
我有一个包含多个主题(R 计算自上次事件以来经过的时间,r,if-statement,time,dplyr,R,If Statement,Time,Dplyr,我有一个包含多个主题(id)的数据框,重复观察(记录在时间time)。每个时间可能与事件相关,也可能与事件无关(事件)。示例数据帧可通过以下方式生成: set.seed(12345) id <- c(rep(1, 9), rep(2, 9), rep(3, 9)) time <- c(seq(from = 0, to = 96, by = 12), seq(from = 0, to = 80, by = 10), seq(from = 0, to = 112,
id
)的数据框,重复观察(记录在时间time
)。每个时间可能与事件相关,也可能与事件无关(事件
)。示例数据帧可通过以下方式生成:
set.seed(12345)
id <- c(rep(1, 9), rep(2, 9), rep(3, 9))
time <- c(seq(from = 0, to = 96, by = 12),
seq(from = 0, to = 80, by = 10),
seq(from = 0, to = 112, by = 14))
random <- runif(n = 27)
event <- rep(100, 27)
df <- data.frame(cbind(id, time, event, random))
df$event <- ifelse(df$random < 0.55, 0, df$event)
df <- subset(df, select = -c(random))
df$event <- ifelse(df$time == 0, 100, df$event)
在fortran中,我使用以下代码创建tae
变量:
IF(EVENT.GT.0) THEN
TEVENT = TIME
TAE = 0
ENDIF
IF(EVENT.EQ.0) THEN
TAE = TIME - TEVENT
ENDIF
在R中,我尝试了ifelse
和dplyr
解决方案。然而,两者都不能产生我想要的输出
# Calculate the time since last event (using ifelse)
df$tae <- ifelse(df$event >= 0, df$tevent = df$time & df$tae = 0, df$tae = df$time - df$tevent)
Error: unexpected '=' in "df$tae <- ifelse(df$event >= 0, df$tevent ="
# Calculate the time since last event (using dplyr)
res <- df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(tae = time - lag(time))
res
id time event tae
1 1 0 100 NA
2 1 12 100 12
3 1 24 100 12
4 1 36 100 12
5 1 48 0 12
6 1 60 0 12
您与您的
dplyr
实现非常接近。试试这个
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(tae = cumsum(event==0)*12)
我现在想不出一种矢量化它的方法,但是这里有一个循环应该非常快(O(n))
event这里有一个使用dplyr的方法:
library(dplyr)
df %>%
mutate(tmpG = cumsum(c(FALSE, as.logical(diff(event))))) %>%
group_by(id) %>%
mutate(tmp_a = c(0, diff(time)) * !event,
tmp_b = c(diff(time), 0) * !event) %>%
group_by(tmpG) %>%
mutate(tae = cumsum(tmp_a),
tbe = rev(cumsum(rev(tmp_b)))) %>%
ungroup() %>%
select(-c(tmp_a, tmp_b, tmpG))
新列包括事件后时间(tae
)和事件前时间(tbe
)
结果是:
id time event tae tbe
1 1 0 100 0 0
2 1 12 100 0 0
3 1 24 100 0 0
4 1 36 100 0 0
5 1 48 0 12 48
6 1 60 0 24 36
7 1 72 0 36 24
8 1 84 0 48 12
9 1 96 100 0 0
10 2 0 100 0 0
11 2 12 0 12 24
12 2 24 0 24 12
13 2 36 100 0 0
14 2 48 0 12 48
15 2 60 0 24 36
16 2 72 0 36 24
17 2 84 0 48 12
18 2 96 0 60 0
19 3 0 100 0 0
20 3 12 100 0 0
21 3 24 0 12 24
22 3 36 0 24 12
23 3 48 100 0 0
24 3 60 100 0 0
25 3 72 100 0 0
26 3 84 0 12 12
27 3 96 100 0 0
第二个示例的结果如下:
id time event tae tbe
1 1 0 100 0 0
2 1 10 0 10 23
3 1 22 0 22 11
4 1 33 100 0 0
5 1 45 0 12 12
6 1 57 100 0 0
7 1 66 0 9 26
8 1 79 0 22 13
9 1 92 100 0 0
我想您可能会对dplyr的紧凑性印象深刻,但是经历许多不必要的计算确实会损害您的时间性能
> loopfun <- function(df){
+
+ event <- (df$event == 100)
+ lasttime <- 0
+
+ time <- df$time
+ tae <- rep(0, nrow(df))
+
+ for(i in 1:nrow(df)){
+
+ if(event[i]){
+
+ lasttime <- time[i]
+
+ }else{
+
+ tae[i] <- time[i] - lasttime
+
+ }
+
+ }
+
+ df$tae <- tae
+
+ return(df)
+ }
>
> dplyrfun <- function(df){
+
+ return(df %>%
+ mutate(tmp = c(0, diff(time)) * !event,
+ tmp2 = cumsum(c(FALSE, as.logical(diff(event))))) %>%
+ group_by(tmp2) %>%
+ mutate(tae = cumsum(tmp)) %>%
+ select(-tmp, -tmp2)
+ )
+
+ }
>
> microbenchmark(loopfun(df), dplyrfun(df), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
loopfun(df) 57.356 70.035 95.89365 82.109 96.599 49001.19 10000
dplyrfun(df) 1494.564 1625.274 1875.85263 1705.722 1877.336 50087.32 10000
>谢谢,回答得好。唯一的问题是,我的活动没有以XX小时/天/周的间隔一致地分布。很抱歉,我没有在原始问题中清楚地说明这一点,我已经修改了上面的示例代码来说明这一点。太好了!您认为可以修改此代码以计算到下一个事件的时间吗?+1对于这个非常好的解决方案。如果您还想删除临时变量tmp2
,那么应该在select(-tmp,-tmp2)
之前插入ungroup()
,谢谢——我很抱歉没有提前意识到这一点,但是如果间隔在IDs内发生变化,那么rev(cumsum(…)
函数将无法产生正确的结果。请参阅我在上面文章中的编辑,以获得一个可复制的示例。在这里的两个示例中,在时间=0时,都有一个事件。当我运行此代码时,tae
在第一个事件发生之前为行提供奇怪的结果。同样,在您的示例中,最后一行有一个事件<当最后一行没有事件时,code>tbe
似乎不起作用。
id time event tae tbe
1 1 0 100 0 0
2 1 12 100 0 0
3 1 24 100 0 0
4 1 36 100 0 0
5 1 48 0 12 48
6 1 60 0 24 36
7 1 72 0 36 24
8 1 84 0 48 12
9 1 96 100 0 0
10 2 0 100 0 0
11 2 12 0 12 24
12 2 24 0 24 12
13 2 36 100 0 0
14 2 48 0 12 48
15 2 60 0 24 36
16 2 72 0 36 24
17 2 84 0 48 12
18 2 96 0 60 0
19 3 0 100 0 0
20 3 12 100 0 0
21 3 24 0 12 24
22 3 36 0 24 12
23 3 48 100 0 0
24 3 60 100 0 0
25 3 72 100 0 0
26 3 84 0 12 12
27 3 96 100 0 0
id time event tae tbe
1 1 0 100 0 0
2 1 10 0 10 23
3 1 22 0 22 11
4 1 33 100 0 0
5 1 45 0 12 12
6 1 57 100 0 0
7 1 66 0 9 26
8 1 79 0 22 13
9 1 92 100 0 0
> loopfun <- function(df){
+
+ event <- (df$event == 100)
+ lasttime <- 0
+
+ time <- df$time
+ tae <- rep(0, nrow(df))
+
+ for(i in 1:nrow(df)){
+
+ if(event[i]){
+
+ lasttime <- time[i]
+
+ }else{
+
+ tae[i] <- time[i] - lasttime
+
+ }
+
+ }
+
+ df$tae <- tae
+
+ return(df)
+ }
>
> dplyrfun <- function(df){
+
+ return(df %>%
+ mutate(tmp = c(0, diff(time)) * !event,
+ tmp2 = cumsum(c(FALSE, as.logical(diff(event))))) %>%
+ group_by(tmp2) %>%
+ mutate(tae = cumsum(tmp)) %>%
+ select(-tmp, -tmp2)
+ )
+
+ }
>
> microbenchmark(loopfun(df), dplyrfun(df), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
loopfun(df) 57.356 70.035 95.89365 82.109 96.599 49001.19 10000
dplyrfun(df) 1494.564 1625.274 1875.85263 1705.722 1877.336 50087.32 10000