按ID和顺序将R中的表分组,不带间隙
我有一个虚构的医院数据表,需要将出院日期替换为(不存在的)患者进行医院转院时的最终出院日期按ID和顺序将R中的表分组,不带间隙,r,plyr,R,Plyr,我有一个虚构的医院数据表,需要将出院日期替换为(不存在的)患者进行医院转院时的最终出院日期 rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1)) subset <- data[rows,] 将为人员B返回错误的结果,而正确的结果应为: ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM NEW_DISCHARGE_DA
rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1))
subset <- data[rows,]
将为人员B返回错误的结果,而正确的结果应为:
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM NEW_DISCHARGE_DATE
A 1992-12-04 3360 0 1993-11-25
A 1993-02-11 3361 1 1993-11-25
A 1993-03-10 3362 2 1993-11-25
A 1993-11-25 3363 3 1993-11-25
B 1987-05-15 3419 0 1987-05-19
B 1987-05-19 3420 1 1987-05-19
B 1990-02-03 3473 0 1990-02-05
B 1990-02-05 3474 1 1990-02-05
ID出院日期文件顺序转移新编号出院日期
A 1992-12-04 3360 0 1993-11-25
A 1993-02-11 3361 1 1993-11-25
A 1993-03-1033622 1993-11-25
A 1993-11-2533633 1993-11-25
B 1987-05-153419 0 1987-05-19
B 1987-05-193420 1 1987-05-19
B 1990-02-03 3473 0 1990-02-05
B 1990-02-05 3474 1 1990-02-05
我想一些额外的分组可能会有所帮助,比如:
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP NEW_DISCHARGE_DATE
A 1992-12-04 3360 0 1 1993-11-25
A 1993-02-11 3361 1 1 1993-11-25
A 1993-03-10 3362 2 1 1993-11-25
A 1993-11-25 3363 3 1 1993-11-25
B 1987-05-15 3419 0 1 1987-05-19
B 1987-05-19 3420 1 1 1987-05-19
B 1990-02-03 3473 0 2 1990-02-05
B 1990-02-05 3474 1 2 1990-02-05
ID出院日期文件顺序转移编号组新出院日期
A 1992-12-04 3360 01 1993-11-25
A 1993-02-11 3361 1 1993-11-25
A 1993-03-10336211993-11-25
A 1993-11-253363311993-11-25
B 1987-05-15341987-05-19
B 1987-05-19 3420 1 1987-05-19
B 1990-02-03 3473 02 1990-02-05
B 1990-02-05 3474 1 2 1990-02-05
任何帮助都将不胜感激 试试看:
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
它假定传输数量是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
尝试:
它假定传输数量是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
尝试:
它假定传输数量是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
尝试:
它假定传输数量是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
没错,您需要一个中间分组列。这里有一个嵌套的
ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
ddply(
ddply(df,“ID”,变异,组=cumsum(c(0,diff(TRANSFER_NUM)<0)),
c(“ID”、“集团”),
变异,放电新=最大值(如字符(放电日期))
)
#ID出院\u日期文件\u顺序转移\u数量组出院\u新建
#1A1992-12-04336001993-11-25
#2 A 1993-02-11 3361 10 1993-11-25
#3 A 1993-03-103362 2 0 1993-11-25
#4 A 1993-11-253363 3 0 1993-11-25
#5B 1987-05-15341901987-05-19
#6B 1987-05-193420 101987-05-19
#7 B 1990-02-03 3473 01 1990-02-05
#8B 1990-02-05 3474 1 1990-02-05
没错,您需要一个中间分组列。这里有一个嵌套的ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
ddply(
ddply(df,“ID”,变异,组=cumsum(c(0,diff(TRANSFER_NUM)<0)),
c(“ID”、“集团”),
变异,放电新=最大值(如字符(放电日期))
)
#ID出院\u日期文件\u顺序转移\u数量组出院\u新建
#1A1992-12-04336001993-11-25
#2 A 1993-02-11 3361 10 1993-11-25
#3 A 1993-03-103362 2 0 1993-11-25
#4 A 1993-11-253363 3 0 1993-11-25
#5B 1987-05-15341901987-05-19
#6B 1987-05-193420 101987-05-19
#7 B 1990-02-03 3473 01 1990-02-05
#8B 1990-02-05 3474 1 1990-02-05
没错,您需要一个中间分组列。这里有一个嵌套的ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
ddply(
ddply(df,“ID”,变异,组=cumsum(c(0,diff(TRANSFER_NUM)<0)),
c(“ID”、“集团”),
变异,放电新=最大值(如字符(放电日期))
)
#ID出院\u日期文件\u顺序转移\u数量组出院\u新建
#1A1992-12-04336001993-11-25
#2 A 1993-02-11 3361 10 1993-11-25
#3 A 1993-03-103362 2 0 1993-11-25
#4 A 1993-11-253363 3 0 1993-11-25
#5B 1987-05-15341901987-05-19
#6B 1987-05-193420 101987-05-19
#7 B 1990-02-03 3473 01 1990-02-05
#8B 1990-02-05 3474 1 1990-02-05
没错,您需要一个中间分组列。这里有一个嵌套的ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
ddply(
ddply(df,“ID”,变异,组=cumsum(c(0,diff(TRANSFER_NUM)<0)),
c(“ID”、“集团”),
变异,放电新=最大值(如字符(放电日期))
)
#ID出院\u日期文件\u顺序转移\u数量组出院\u新建
#1A1992-12-04336001993-11-25
#2 A 1993-02-11 3361 10 1993-11-25
#3 A 1993-03-103362 2 0 1993-11-25
#4 A 1993-11-253363 3 0 1993-11-25
#5B 1987-05-15341901987-05-19
#6B 1987-05-193420 101987-05-19
#7 B 1990-02-03 3473 01 1990-02-05
#8B 1990-02-05 3474 1 1990-02-05
只是一个温和的提示:数据
和子集
都是使用良好的R
命令。您可能会考虑不使用它们作为对象名称。