R 如何展平/合并重叠的时间段

R 如何展平/合并重叠的时间段,r,date,datetime,lubridate,R,Date,Datetime,Lubridate,我有一个时间段的大数据集,由“开始”和“结束”列定义。有些时期重叠 我希望合并(展平/合并/折叠)所有重叠的时间段,以获得一个“开始”值和一个“结束”值 一些示例数据: ID start end 1 A 2013-01-01 2013-01-05 2 A 2013-01-01 2013-01-05 3 A 2013-01-02 2013-01-03 4 A 2013-01-04 2013-01-06 5 A 2013-01-07 2013-01-09 6

我有一个时间段的大数据集,由“开始”和“结束”列定义。有些时期重叠

我希望合并(展平/合并/折叠)所有重叠的时间段,以获得一个“开始”值和一个“结束”值

一些示例数据:

  ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15
预期结果:

  ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15
我所尝试的:

  require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"), 
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600, 
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200, 
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct", 
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA, 
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)  
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)
require(dplyr)

数据这里有一个可能的解决方案。这里的基本思想是使用
cummax
函数将滞后的
start
日期与“到目前为止”的最大结束日期进行比较,并创建一个将数据分为多组的索引

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

@David Arenburg的回答很好-但我遇到了一个问题,即较早的间隔在较晚的间隔后结束-但是在
摘要
调用中使用
last
导致了错误的结束日期。我建议将
first(start)
last(end)
更改为
min(start)
max(end)


此外,正如@Jonno Bourne所提到的,在应用该方法之前,按
start
和任何分组变量进行排序都很重要。

为了完整起见,有一些简洁的函数可用于处理日期或日期时间范围。其中一个是
reduce()
函数,它合并重叠或相邻的范围

但是,有一个缺点,因为
IRanges
在整数范围内工作(因此得名),因此使用
IRanges
函数的便利性是以来回转换
Date
POSIXct
对象为代价的

另外,似乎
dplyr
IRanges
(至少从我对
dplyr
的有限经验判断)不太合拍,所以我使用
数据。表

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
在这两种变体中,当将数字转换为
POSIXct
对象时,使用
lubridate
软件包中的
as_datetime()
指定原点


看看
IRanges
方法和方法的基准比较会很有趣。看起来我参加聚会有点晚了,但我拿了@zach的代码并用
数据重新编写了它。下表
。我没有做全面的测试,但这似乎比
tidy
版本快20%。(我无法测试
IRange
方法,因为R3.5.1尚未提供该包)

此外,fwiw,接受的答案不包含一个日期范围完全在另一个日期范围内的边缘情况(例如,
2018-07-07
2017-07-14
2018-05-01
2018-12-01
)@扎克的回答确实抓住了这个边缘案例

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]
库(data.table)
start_col=c(“2018-01-01”、“2018-03-01”、“2018-03-10”、“2018-03-20”、“2018-04-10”、“2018-05-01”、“2018-05-05”、“2018-05-10”、“2018-07-07”)
结束col=c(“2018-01-21”、“2018-03-21”、“2018-03-31”、“2018-04-09”、“2018-04-30”、“2018-05-21”、“2018-05-26”、“2018-05-30”、“2018-07-14”)
#创建假数据,加倍,添加ID
#更改第17行,使每个ID分组略有不同
#还添加了一个边缘情况,其中一个日期范围完全在另一个日期范围内
#(这是当前未被接受答案捕获的边缘情况)

d谢谢你这么好的回答!但问题是,当我在实际数据集上使用该函数时,日期最终以第二种格式保存,我必须将摘要变量包装为.POSIXct()以将其转换回任何想法为什么?不确定这是什么意思。。。当我将结果保存在某个变量中时,
start
end
都属于
POSIXct
…顺便说一句,如果使用多个ID,则必须按排列(数据、ID、开始)进行排列,因为延迟不受分组的影响,因此可能会从ID组外部获取日期,从而弄乱最终结构。这不是问题的一部分,但我在课后找到了困难的答案。
[-n()]
做了什么?我能够根据自己的需要调整它(类似的情况,但日期之间有90天的间隔仍然被视为“重叠”),但我不得不逐字复制
[-n()]
而不真正理解它的作用。啊哈!我想出来了。(它将删除
cumsum
中的最后一项,以适应向量开头添加的
0
)除了折叠具有重叠间隔的行之外,如果我还想取另一列的最小值,我们如何做?e、 g.
data@HNSKD,请将其作为单独的问题发布。但一个简单的答案是:
library(data.table);setDT(数据)[顺序(开始,结束),grp:=cumsum(最大值)(移位(作为数值(结束),填充=0))<作为数值(开始))][,(开始=min(开始),结束=max(结束),值=min(值)),by=grp]
       ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]
library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]