R 在连续结果旁边操纵日期

R 在连续结果旁边操纵日期,r,dplyr,data.table,R,Dplyr,Data.table,我需要一些连续结果的帮助 以下是我的示例数据: df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive" ), class = c("ordered"

我需要一些连续结果的帮助

以下是我的示例数据:

df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909, 
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645, 
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA, 
-16L))
df 30天,无“阳性”结果

idno==1的示例答案为2013-10-29,idno==2的示例答案为2015-11-06

我尝试过使用
rle(as.character(df$result))
,但很难理解如何将其应用于分组数据

我更喜欢使用dplyr或data.table的方法


谢谢你的帮助

A
dplyr
为基础的解决方案可以通过创建一组连续出现的
result
列,然后最后选择符合标准的第一个出现:

library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>% 
  group_by(idno) %>%
  arrange(samp_date) %>%
  mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
  group_by(idno, result_grp) %>%
  filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
  slice(1) %>%
  ungroup() %>%
  select(-result_grp) 

# # A tibble: 2 x 3
# idno result   samp_date 
# <dbl> <ord>    <date>    
# 1  1.00 Negative 2013-10-29
# 2  2.00 Negative 2015-11-06
库(dplyr)
df%>%突变(samp_日期=as.date(samp_日期))%>%
分组依据(idno)%>%
安排(抽样日期)%>%
mutate(result_grp=cumsum(as.character(result)!=lag(as.character(result),default=”“))%%>%
分组依据(idno,结果组)%>%
过滤器(结果=“负”&(最大(采样日期)-最小(采样日期))>=30)%
切片(1)%>%
解组()%>%
选择(-result\u grp)
##tibble:2 x 3
#idno结果样本日期
#          
#1 1.00负2013-10-29
#2.00负2015-11-06
库(dplyr)
df%>%分组依据(idno)%>%
变异(时间差=ifelse(结果==“负面”和领先(结果==“负面”),采样日期-领先(采样日期),0),
ConsNegDate=min(采样日期[其中(绝对值(时间差)>30)])
#一个tibble:16x5
#组别:idno[2]
idno结果样本日期时间差异CONSNAGDATE
1否定的2013-07-23 0 2013-10-29
2 1积极的2013-08-21 0 2013-10-29
3 1积极的2013-10-01 0 2013-10-29
4 1负2013-10-29-34 2013-10-29
5 1否定2013-12-02-39 2013-10-29
6 1负2014-01-10-102 2013-10-29
7 1负2014-04-22-322 2013-10-29
8 1负2015-03-10-72 2013-10-29
9 1负2015-05-21-69 2013-10-29
10 1负2015-07-29 NA 2013-10-29
11 2积极的2015-10-13 0 2015-11-06
12 2负2015-11-06-360 2015-11-06
13 2负2016-10-31 0 2015-11-06
14 2积极的2017-04-25 0 2015-11-06
15 2积极的2017-09-07 0 2015-11-06
16 2积极的2017-09-20 0 2015-11-06

与@MKR的答案类似,您可以创建一个分组变量并在data.table中汇总:

library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]

# summarize by grouping var g = rleid(idno, result)    
runDT = df[, .(
  start = first(samp_date),
  end  = last(samp_date),
  dur  = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]

#    idno   result g      start        end      dur
# 1:    1 Negative 1 2013-07-23 2013-07-23   0 days
# 2:    1 Positive 2 2013-08-21 2013-10-01  41 days
# 3:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 4:    2 Positive 4 2015-10-13 2015-10-13   0 days
# 5:    2 Negative 5 2015-11-06 2016-10-31 360 days
# 6:    2 Positive 6 2017-04-25 2017-09-20 148 days

# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30), 
  on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]

# filter
runDT[w]

#    idno   result g      start        end      dur
# 1:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 2:    2 Negative 5 2015-11-06 2016-10-31 360 days

我宁愿只做连接而不是使用
w
,但是
dur
的内容将被
min\u dur
填充,这并不理想。。。不确定这与哪个问题有关,可能要感谢Frank的回答-我选择dplyr版本只是因为我发现DT更难理解。@Frank使用rleid使data.table更容易。@MKR您也可以在这里编辑您的帖子来说明rleid(如)。OP可以在不需要其他data.table语法的情况下使用该函数。谢谢@Frank。我只是希望只装一个包裹。此外,对我来说,使用基于
cumsum
的方法来获得类似于
rleid
的东西似乎更简单、更直观。
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]

# summarize by grouping var g = rleid(idno, result)    
runDT = df[, .(
  start = first(samp_date),
  end  = last(samp_date),
  dur  = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]

#    idno   result g      start        end      dur
# 1:    1 Negative 1 2013-07-23 2013-07-23   0 days
# 2:    1 Positive 2 2013-08-21 2013-10-01  41 days
# 3:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 4:    2 Positive 4 2015-10-13 2015-10-13   0 days
# 5:    2 Negative 5 2015-11-06 2016-10-31 360 days
# 6:    2 Positive 6 2017-04-25 2017-09-20 148 days

# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30), 
  on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]

# filter
runDT[w]

#    idno   result g      start        end      dur
# 1:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 2:    2 Negative 5 2015-11-06 2016-10-31 360 days