R 根据日期是否在多个时间段的其他日期之间创建新列
我有一个多行的表,每个纳税年度的结束日期为:R 根据日期是否在多个时间段的其他日期之间创建新列,r,dplyr,data.table,data-manipulation,R,Dplyr,Data.table,Data Manipulation,我有一个多行的表,每个纳税年度的结束日期为: df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE, "01", "2009-04-06", "01", "2010-04-06", "01", "2011-04-06", "02
df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE,
"01", "2009-04-06",
"01", "2010-04-06",
"01", "2011-04-06",
"02", "2010-04-06",
"02", "2011-04-06",
"02", "2012-04-06")
我发现我可以通过ID
连接表,然后在使用mutate()
创建一个新列时应用一些规则-如果TY\u END\u DATE
介于START\u DATE
和END\u DATE
之间,则使用状态
,如果不是,则不使用状态
我陷入困境的是那些在第二张表中有一段以上工作经历的借款人。在这种情况下,当我执行连接时,第一个表中的行会被复制(或更多),我还没有找到一种替代方法
我使用的是R,我更喜欢data.table,因为它通常更快,但dplyr也可以。一个
dplyr
和lubridate
解决方案可能是:
df1 %>%
left_join(df2) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(STATUS = any(int_overlaps(interval(TAX_YEAR_END_DATE, TAX_YEAR_END_DATE),
interval(START_DATE, END_DATE))))
ID TAX_YEAR_END_DATE STATUS
<int> <chr> <lgl>
1 1 2009-04-06 TRUE
2 1 2010-04-06 TRUE
3 1 2011-04-06 FALSE
4 2 2010-04-06 TRUE
5 2 2011-04-06 FALSE
6 2 2012-04-06 TRUE
df1%>%
左联合(df2)%>%
分组依据(识别号、纳税年度、结束日期)%>%
总结(状态=任何(内部重叠)(间隔(纳税年度结束日期、纳税年度结束日期),
间隔(开始日期、结束日期)
ID税\年度\结束\日期状态
1 1 2009-04-06正确
21 2010-04-06真实
3112011-04-06假
4.2 2010-04-06真实
522011-04-06假
6.2 2012-04-06真实
一种解决方案,使用联接来关联表,然后是摘要
df1 %>% left_join(df2, by = "ID") %>%
mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(employed = any(employed))
#为使用ID的持续时间创建查找data.frame:
#dates_ro=>data.frame
日期字符向量
df1$STATUS在数据表中使用非等联接的选项。表:
DT1[, status := c("NOT","EMP")[
DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
by=.EACHI, .N>0L]$V1 + 1L
]]
数据:
库(data.table)
DT1
df1 %>% left_join(df2, by = "ID") %>%
mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(employed = any(employed))
# Create a lookup data.frame for the durations in which ID was employed:
# dates_ro => data.frame
dates_ro <- data.frame(do.call("rbind", lapply(split(df2, rownames(df2)), function(x){
data.frame(id = x$ID,
emp_date = seq.Date(x$START_DATE, x$END_DATE, by = "days"))
}
)
),
row.names = NULL)
# Lookup whether or not the person is employed at end date
# STATUS => character vector
df1$STATUS <- ifelse(is.na(
match(df1$ID, dates_ro$id) &
match(df1$TAX_YEAR_END_DATE, dates_ro$emp_date)),"UNEMPLOYED", "EMPLOYED")
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TAX_YEAR_END_DATE = structure(c(14340,
14705, 15070, 14705, 15070, 15436), class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 2L, 2L), START_DATE = structure(c(13767,
14036, 15226), class = "Date"), END_DATE = structure(c(14705,
14705, 16166), class = "Date")), class = "data.frame", row.names = c(NA, -3L))
DT1[, status := c("NOT","EMP")[
DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
by=.EACHI, .N>0L]$V1 + 1L
]]
ID TAX_YEAR_END_DATE status
1: 1 2009-04-06 EMP
2: 1 2010-04-06 EMP
3: 1 2011-04-06 NOT
4: 2 2010-04-06 EMP
5: 2 2011-04-06 NOT
6: 2 2012-04-06 EMP
library(data.table)
DT1 <- fread("ID TAX_YEAR_END_DATE
01 2009-04-06
01 2010-04-06
01 2011-04-06
02 2010-04-06
02 2011-04-06
02 2012-04-06")[,
TAX_YEAR_END_DATE := as.IDate(TAX_YEAR_END_DATE)]
cols <- c("START_DATE", "END_DATE")
DT2 <- fread("ID START_DATE END_DATE
01 2007-09-11 2010-04-06
02 2008-06-06 2010-04-06
02 2011-09-09 2014-04-06")[,
(cols) := lapply(.SD, as.IDate), .SDcols=cols]