R 在组中查找重叠的时间间隔,并保留最大的非重叠时间间隔
问题 我有一个间隔重叠的分组数据帧(日期为ymd)。我只想保留每组中最大的非重叠间隔 示例数据R 在组中查找重叠的时间间隔,并保留最大的非重叠时间间隔,r,dplyr,lubridate,R,Dplyr,Lubridate,问题 我有一个间隔重叠的分组数据帧(日期为ymd)。我只想保留每组中最大的非重叠间隔 示例数据 # Packages library(tidyverse) library(lubridate) # Example data df <- tibble( group = c(1, 1, 1, 2, 2, 3, 3, 3, 3), start = as_date( c("2019-01-10", "2019-02-01", "2
# Packages
library(tidyverse)
library(lubridate)
# Example data
df <- tibble(
group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
start = as_date(
c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
end = as_date(
c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
mutate(intval = interval(start, end),
intval_length = intval / days(1))
df
#> # A tibble: 9 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#> 2 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 5 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#> 8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#> 9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]
df1
#> # A tibble: 6 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 2 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 3 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
#包
图书馆(tidyverse)
图书馆(lubridate)
#示例数据
df%
突变(intval=间隔(开始、结束),
intval_长度=intval/天(1))
df
#>#tibble:9 x 5
#>组开始-结束intval intval\u长度
#>
#>1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#>21 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#>3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#>4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#>52 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#>6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#>7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#>8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#>9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
#目标
#第1行和第2行;6-9有重叠;以最大间隔保留行(以天为单位)
df1#A tible:6 x 5
#>组开始-结束intval intval\u长度
#>
#>1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#>21 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#>3.2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#>4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#>5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#>6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
当前方法
我在另一个线程中发现了一个相关问题(请参阅:)。但是,相应的解决方案按组标识所有重叠行。这样,我无法确定最大的非重叠间隔
df$overlap <- unlist(tapply(df$intval, #loop through intervals
df$group, #grouped by id
function(x) rowSums(outer(x,x,int_overlaps)) > 1))
df$1)
作为示例,在示例数据中考虑第3组。这里第6/7行和第8/9行重叠。由于第6行和第9行是最大的非重叠时段,我想删除第7行和第8行
如果有人能为我找到解决方案,我将不胜感激。在搜索了stackoverflow的相关问题后,我发现以下方法(此处:)和(此处:)可以适用于我的问题
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE
#解决方案改编自:
#这里https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
#在这里:https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
#注:df和df1在初始reprex中创建(如上)
df2%
分组依据(分组)%>%
排列(组,开始)%>%
mutate(indx=c(0,cumsum(as.numeric(lead(start))>#查找重叠
cummax(as.numeric(end))[-n()])]%>%
解组()%>%
分组依据(分组,indx)%>%
排列(desc(intval_length))%>%#保留最大间隔
过滤器(行号()==1)%>%
解组()%>%
选择(-indx)%>%
安排(分组,开始)
#期望输出?
相同(df1、df2)
#>[1]是的
在搜索stackoverflow的相关问题后,我发现以下方法(此处:)和(此处:)可以适用于我的问题
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE
#解决方案改编自:
#这里https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
#在这里:https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
#注:df和df1在初始reprex中创建(如上)
df2%
分组依据(分组)%>%
排列(组,开始)%>%
mutate(indx=c(0,cumsum(as.numeric(lead(start))>#查找重叠
cummax(as.numeric(end))[-n()])]%>%
解组()%>%
分组依据(分组,indx)%>%
排列(desc(intval_length))%>%#保留最大间隔
过滤器(行号()==1)%>%
解组()%>%
选择(-indx)%>%
安排(分组,开始)
#期望输出?
相同(df1、df2)
#>[1]是的
这看起来很重要。如果你的数据不是太大,你可以强行使用它。否则,您需要一个算法。这不是一个真正的问题。这看起来很重要。如果你的数据不是太大,你可以强行使用它。否则,您需要一个算法。这不是一个真正的问题。