R 在连续结果旁边操纵日期
我需要一些连续结果的帮助 以下是我的示例数据:R 在连续结果旁边操纵日期,r,dplyr,data.table,R,Dplyr,Data.table,我需要一些连续结果的帮助 以下是我的示例数据: df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive" ), class = c("ordered"
df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909,
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645,
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA,
-16L))
df 30天,无“阳性”结果
idno==1的示例答案为2013-10-29,idno==2的示例答案为2015-11-06
我尝试过使用rle(as.character(df$result))
,但很难理解如何将其应用于分组数据
我更喜欢使用dplyr或data.table的方法
谢谢你的帮助 Adplyr
为基础的解决方案可以通过创建一组连续出现的result
列,然后最后选择符合标准的第一个出现:
library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)
# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06
库(dplyr)
df%>%突变(samp_日期=as.date(samp_日期))%>%
分组依据(idno)%>%
安排(抽样日期)%>%
mutate(result_grp=cumsum(as.character(result)!=lag(as.character(result),default=”“))%%>%
分组依据(idno,结果组)%>%
过滤器(结果=“负”&(最大(采样日期)-最小(采样日期))>=30)%
切片(1)%>%
解组()%>%
选择(-result\u grp)
##tibble:2 x 3
#idno结果样本日期
#
#1 1.00负2013-10-29
#2.00负2015-11-06
库(dplyr)
df%>%分组依据(idno)%>%
变异(时间差=ifelse(结果==“负面”和领先(结果==“负面”),采样日期-领先(采样日期),0),
ConsNegDate=min(采样日期[其中(绝对值(时间差)>30)])
#一个tibble:16x5
#组别:idno[2]
idno结果样本日期时间差异CONSNAGDATE
1否定的2013-07-23 0 2013-10-29
2 1积极的2013-08-21 0 2013-10-29
3 1积极的2013-10-01 0 2013-10-29
4 1负2013-10-29-34 2013-10-29
5 1否定2013-12-02-39 2013-10-29
6 1负2014-01-10-102 2013-10-29
7 1负2014-04-22-322 2013-10-29
8 1负2015-03-10-72 2013-10-29
9 1负2015-05-21-69 2013-10-29
10 1负2015-07-29 NA 2013-10-29
11 2积极的2015-10-13 0 2015-11-06
12 2负2015-11-06-360 2015-11-06
13 2负2016-10-31 0 2015-11-06
14 2积极的2017-04-25 0 2015-11-06
15 2积极的2017-09-07 0 2015-11-06
16 2积极的2017-09-20 0 2015-11-06
与@MKR的答案类似,您可以创建一个分组变量并在data.table中汇总:
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days
我宁愿只做连接而不是使用w
,但是dur
的内容将被min\u dur
填充,这并不理想。。。不确定这与哪个问题有关,可能要感谢Frank的回答-我选择dplyr版本只是因为我发现DT更难理解。@Frank使用rleid使data.table更容易。@MKR您也可以在这里编辑您的帖子来说明rleid(如)。OP可以在不需要其他data.table语法的情况下使用该函数。谢谢@Frank。我只是希望只装一个包裹。此外,对我来说,使用基于cumsum
的方法来获得类似于rleid
的东西似乎更简单、更直观。
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days