R 根据日期是否在多个时间段的其他日期之间创建新列

R 根据日期是否在多个时间段的其他日期之间创建新列,r,dplyr,data.table,data-manipulation,R,Dplyr,Data.table,Data Manipulation,我有一个多行的表,每个纳税年度的结束日期为: df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE, "01", "2009-04-06", "01", "2010-04-06", "01", "2011-04-06", "02

我有一个多行的表,每个纳税年度的结束日期为:

df1 <- tibble::tribble(~ID,       ~TAX_YEAR_END_DATE,
                       "01",      "2009-04-06",
                       "01",      "2010-04-06",
                       "01",      "2011-04-06",
                       "02",      "2010-04-06",
                       "02",      "2011-04-06",
                       "02",      "2012-04-06")
我发现我可以通过
ID
连接表,然后在使用
mutate()
创建一个新列时应用一些规则-如果
TY\u END\u DATE
介于
START\u DATE
END\u DATE
之间,则使用
状态
,如果不是,则不使用
状态

我陷入困境的是那些在第二张表中有一段以上工作经历的借款人。在这种情况下,当我执行连接时,第一个表中的行会被复制(或更多),我还没有找到一种替代方法


我使用的是R,我更喜欢data.table,因为它通常更快,但dplyr也可以。

一个
dplyr
lubridate
解决方案可能是:

df1 %>%
 left_join(df2) %>%
 group_by(ID, TAX_YEAR_END_DATE) %>%
 summarise(STATUS = any(int_overlaps(interval(TAX_YEAR_END_DATE, TAX_YEAR_END_DATE),
                                     interval(START_DATE, END_DATE))))

     ID TAX_YEAR_END_DATE STATUS
  <int> <chr>             <lgl> 
1     1 2009-04-06        TRUE  
2     1 2010-04-06        TRUE  
3     1 2011-04-06        FALSE 
4     2 2010-04-06        TRUE  
5     2 2011-04-06        FALSE 
6     2 2012-04-06        TRUE  
df1%>%
左联合(df2)%>%
分组依据(识别号、纳税年度、结束日期)%>%
总结(状态=任何(内部重叠)(间隔(纳税年度结束日期、纳税年度结束日期),
间隔(开始日期、结束日期)
ID税\年度\结束\日期状态
1 1 2009-04-06正确
21 2010-04-06真实
3112011-04-06假
4.2 2010-04-06真实
522011-04-06假
6.2 2012-04-06真实

一种解决方案,使用联接来关联表,然后是摘要

df1 %>% left_join(df2, by = "ID") %>% 
  mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>% 
  group_by(ID, TAX_YEAR_END_DATE) %>% 
  summarise(employed = any(employed))
#为使用ID的持续时间创建查找data.frame:
#dates_ro=>data.frame
日期字符向量

df1$STATUS在数据表中使用非等联接的选项。表:

DT1[, status := c("NOT","EMP")[
    DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
        by=.EACHI, .N>0L]$V1 + 1L
]]
数据:

库(data.table)
DT1
df1 %>% left_join(df2, by = "ID") %>% 
  mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>% 
  group_by(ID, TAX_YEAR_END_DATE) %>% 
  summarise(employed = any(employed))
# Create a lookup data.frame for the durations in which ID was employed:
# dates_ro => data.frame
dates_ro <- data.frame(do.call("rbind", lapply(split(df2, rownames(df2)), function(x){
      data.frame(id = x$ID, 
                 emp_date = seq.Date(x$START_DATE, x$END_DATE, by = "days"))
    }
  )
),
row.names = NULL)

# Lookup whether or not the person is employed at end date
# STATUS => character vector
df1$STATUS <- ifelse(is.na(
  match(df1$ID, dates_ro$id) &
    match(df1$TAX_YEAR_END_DATE, dates_ro$emp_date)),"UNEMPLOYED", "EMPLOYED")
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TAX_YEAR_END_DATE = structure(c(14340, 
14705, 15070, 14705, 15070, 15436), class = "Date")), 
class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(ID = c(1L, 2L, 2L), START_DATE = structure(c(13767, 
14036, 15226), class = "Date"), END_DATE = structure(c(14705, 
14705, 16166), class = "Date")), class = "data.frame", row.names = c(NA, -3L))
DT1[, status := c("NOT","EMP")[
    DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
        by=.EACHI, .N>0L]$V1 + 1L
]]
   ID TAX_YEAR_END_DATE status
1:  1        2009-04-06    EMP
2:  1        2010-04-06    EMP
3:  1        2011-04-06    NOT
4:  2        2010-04-06    EMP
5:  2        2011-04-06    NOT
6:  2        2012-04-06    EMP
library(data.table)
DT1 <- fread("ID      TAX_YEAR_END_DATE
01      2009-04-06
01      2010-04-06
01      2011-04-06
02      2010-04-06
02      2011-04-06
02      2012-04-06")[, 
    TAX_YEAR_END_DATE := as.IDate(TAX_YEAR_END_DATE)]

cols <- c("START_DATE", "END_DATE")
DT2 <- fread("ID    START_DATE    END_DATE
01    2007-09-11    2010-04-06
02    2008-06-06    2010-04-06
02    2011-09-09    2014-04-06")[, 
     (cols) := lapply(.SD, as.IDate), .SDcols=cols]