R或SQL中键内的循环id
假设我有以下数据:R或SQL中键内的循环id,sql,r,Sql,R,假设我有以下数据: key id value Class duration Cond Start End ----- ---- -------- -------- --------- --------- ----------- ----------- 30 1 A A,B NA NA 2018-02-27 2
key id value Class duration Cond Start End
----- ---- -------- -------- --------- --------- ----------- -----------
30 1 A A,B NA NA 2018-02-27 2018-03-07
30 2 B B 19 20 2018-02-27 2018-03-26
40 1 C C,D NA NA 2018-12-17 2018-12-25
40 2 D D 168 30 2018-12-17 2019-06-11
50 1 A A,C,D NA NA 2018-04-10 2018-06-21
50 2 C C,D 16 30 2018-04-10 2018-07-07
50 3 D D 28 20 2018-04-10 2018-08-04
60 1 B B,C,D NA NA 2016-05-13 2016-05-18
60 2 C C,D 49 20 2016-05-13 2016-07-06
60 3 D D 47 30 2016-05-13 2016-08-22
70 1 A A,C,D NA NA 2017-01-09 2017-11-01
70 2 C C,D 60 5 2017-01-09 2017-12-31
70 3 D D 17 28 2017-01-09 2018-01-17
80 1 A A,C,D NA NA 2019-09-18 2020-01-07
80 2 C C,D 2 20 2019-09-18 2020-01-09
80 3 D D 2 30 2019-09-18 2020-01-11
90 1 A A,B,C,D NA NA 2017-01-17 2017-02-15
90 2 B B,C,D 21 30 2017-01-17 2017-03-08
90 3 C C,D 23 20 2017-01-17 2017-03-31
90 4 D D 299 28 2017-01-17 2018-01-24
可以使用以下代码生成数据:
df <- as.data.frame(cbind(key = c(30, 30, 40, 40, 50, 50, 50, 60, 60, 60,
70, 70, 70, 80, 80, 80, 90, 90, 90, 90),
id = c(1, 2, 1, 2, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 4),
value = c("A", "B", "C", "D", "A", "C", "D", "B", "C", "D", "A", "C", "D","A", "C", "D",
"A", "B", "C", "D"),
Class = c("A,B", "B", "C,D", "D", "A,C,D", "C,D", "D", "B,C,D", "C,D", "D", "A,C,D", "C,D", "D",
"A,C,D", "C,D", "D", "A,B,C,D", "B,C,D", "C,D", "D"),
duration = c(NA, 19, NA, 168, NA, 16, 28, NA, 49, 47,
NA, 60, 17, NA, 2, 2, NA, 21, 23, 299),
Cond = c(NA, 20, NA, 30, NA, 30, 20, NA, 20, 30,
NA, 5, 28, NA, 20, 30, NA, 30, 20, 28),
Start = c("2018-02-27", "2018-02-27", "2018-12-17", "2018-12-17", "2018-04-10", "2018-04-10", "2018-04-10",
"2016-05-13", "2016-05-13", "2016-05-13", "2017-01-09", "2017-01-09", "2017-01-09",
"2020-09-08", "2019-09-18", "2019-09-18", "2017-01-17", "2017-01-17", "2017-01-17", "2017-01-17"),
End = c("2018-03-07", "2018-03-26", "2018-12-25", "2019-06-11", "2018-06-21", "2018-07-07", "2018-08-04",
"2016-05-18", "2016-07-06", "2016-08-22", "2017-11-01", "2017-12-31", "2018-01-17",
"2020-01-07", "2020-01-09", "2020-01-11", "2017-02-15", "2017-03-08", "2017-03-31", "2018-01-24")
))
基于此逻辑,然后生成此新数据:
key id value Class duration Cond Start End
----- ---- -------- -------- --------- --------- ----------- -----------
30 1 A A,B NA NA 2018-02-27 2018-03-26
40 1 C C,D NA NA 2018-12-17 2018-12-25
40 2 D D 168 30 2018-12-26 2019-06-11
50 1 A A,C,D NA NA 2018-04-10 2018-07-07
50 3 D D 28 20 2018-07-08 2018-08-04
60 1 B B,C,D NA NA 2016-05-13 2016-05-18
60 2 C C,D 49 20 2016-05-19 2016-07-06
60 3 D D 47 30 2016-07-07 2016-08-22
70 1 A A,C,D NA NA 2017-01-09 2017-11-01
70 2 C C,D 60 5 2017-11-02 2018-01-17
80 1 A A,C,D NA NA 2019-09-18 2020-01-11
90 1 A A,B,C,D NA NA 2017-01-17 2017-03-08
90 3 C C,D 23 20 2017-03-09 2017-03-31
90 4 D D 299 28 2017-04-01 2018-01-24
您可以尝试:
library(dplyr)
df %>%
mutate(across(duration:Cond, ~ as.integer(as.character(.))),
across(Start:End, ~ as.Date(as.character(.)))) %>%
group_by(key, idx = cumsum((is.na(duration) & is.na(Cond)) | duration >= Cond)) %>%
summarise(across(id:Start, first), End = last(End)) %>%
mutate(Start = case_when(row_number() == 1 ~ Start, TRUE ~ lag(End) + 1L)) %>%
ungroup() %>%
select(-idx)
输出:
# A tibble: 14 x 8
key id value Class duration Cond Start End
<fct> <fct> <fct> <fct> <int> <int> <date> <date>
1 30 1 A A,B NA NA 2018-02-27 2018-03-26
2 40 1 C C,D NA NA 2018-12-17 2018-12-25
3 40 2 D D 168 30 2018-12-26 2019-06-11
4 50 1 A A,C,D NA NA 2018-04-10 2018-07-07
5 50 3 D D 28 20 2018-07-08 2018-08-04
6 60 1 B B,C,D NA NA 2016-05-13 2016-05-18
7 60 2 C C,D 49 20 2016-05-19 2016-07-06
8 60 3 D D 47 30 2016-07-07 2016-08-22
9 70 1 A A,C,D NA NA 2017-01-09 2017-11-01
10 70 2 C C,D 60 5 2017-11-02 2018-01-17
11 80 1 A A,C,D NA NA 2020-09-08 2020-01-11
12 90 1 A A,B,C,D NA NA 2017-01-17 2017-03-08
13 90 3 C C,D 23 20 2017-03-09 2017-03-31
14 90 4 D D 299 28 2017-04-01 2018-01-24
#一个tible:14 x 8
密钥id值类持续时间秒开始结束
1301A,B NA NA 2018-02-27 2018-03-26
2 40 1 C,D NA NA 2018-12-17 2018-12-25
3 40 2 D 168 30 2018-12-26 2019-06-11
4 50 1 A、C、D NA 2018-04-10 2018-07-07
5 50 3 D 28 20 2018-07-08 2018-08-04
6 60 1 B,C,D NA 2016-05-13 2016-05-18
7 60 2 C,D 49 20 2016-05-19 2016-07-06
8 60 3 D 47 30 2016-07-07 2016-08-22
9 70 1 A、C、D NA NA 2017-01-09 2017-11-01
10 70 2 C,D 60 5 2017-11-02 2018-01-17
11 80 1 A、C、D NA 2020-09-08 2020-01-11
12 901 A、B、C、D NA NA 2017-01-17 2017-03-08
13 90 3 C,D 23 20 2017-03-09 2017-03-31
14 90 4 D 299 28 2017-04-01 2018-01-24
但是,请注意,对于
键
90,还有一行-如23>20
。如果这不正确,您需要提供一些额外的解释。谢谢!你是对的,最后一个是我这边的错误。这个解决方案太棒了。非常感谢!!!
# A tibble: 14 x 8
key id value Class duration Cond Start End
<fct> <fct> <fct> <fct> <int> <int> <date> <date>
1 30 1 A A,B NA NA 2018-02-27 2018-03-26
2 40 1 C C,D NA NA 2018-12-17 2018-12-25
3 40 2 D D 168 30 2018-12-26 2019-06-11
4 50 1 A A,C,D NA NA 2018-04-10 2018-07-07
5 50 3 D D 28 20 2018-07-08 2018-08-04
6 60 1 B B,C,D NA NA 2016-05-13 2016-05-18
7 60 2 C C,D 49 20 2016-05-19 2016-07-06
8 60 3 D D 47 30 2016-07-07 2016-08-22
9 70 1 A A,C,D NA NA 2017-01-09 2017-11-01
10 70 2 C C,D 60 5 2017-11-02 2018-01-17
11 80 1 A A,C,D NA NA 2020-09-08 2020-01-11
12 90 1 A A,B,C,D NA NA 2017-01-17 2017-03-08
13 90 3 C C,D 23 20 2017-03-09 2017-03-31
14 90 4 D D 299 28 2017-04-01 2018-01-24