R 基于日期范围的数据表合并

R 基于日期范围的数据表合并,r,data.table,R,Data.table,我有两个表,策略和索赔 policies<-data.table(policyNumber=c(123,123,124,125), EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1"))) > policies policy

我有两个表,
策略
索赔

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), 
                EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
   policyNumber       EFDT       EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01


claims<-data.table(claimNumber=c(1,2,3,4), 
                   policyNumber=c(123,123,123,124),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
                   claimAmount=c(10,20,20,15))
> claims
   claimNumber policyNumber   lossDate claimAmount
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15
策略
保单号码EFDT EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01
索赔
索赔编号保单编号损失索赔金额
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15
策略表实际上包含策略术语,因为每一行都由策略编号和生效日期唯一标识

我想以一种将索赔与保单条款相关联的方式合并这两个表。如果索赔具有相同的保单编号,且索赔的损失日期在保单期限的生效日期和到期日期内(生效日期为包含边界,到期日期为排除边界),则索赔与保单期限相关联。如何以这种方式合并表

这应该类似于左外连接。结果应该是这样的

   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:          125 2013-02-01 2014-02-01          NA       <NA>          NA
policyNumber EFDT EXDT claimNumber lossDate索赔挂载
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:125 2013-02-01 2014-02-01不适用

我想这主要是你想要的。我需要运行,因此没有时间添加没有索赔的保单并清理专栏,但我认为困难的问题已经解决:

setkey(policies, policyNumber, EXDT)
policies[, EXDT2:=EXDT]
policies[claims[, list( policyNumber, lossDate, lossDate, claimNumber, claimAmount)], roll=-Inf]
#    policyNumber       EXDT       EFDT      EXDT2   lossDate claimNumber claimAmount
# 1:          123 2012-02-01 2012-01-01 2013-01-01 2012-02-01           1          10
# 2:          123 2012-08-15 2012-01-01 2013-01-01 2012-08-15           2          20
# 3:          123 2013-01-01 2012-01-01 2013-01-01 2013-01-01           3          20
# 4:          124 2013-10-31 2013-01-01 2014-01-01 2013-10-31           4          15
另外,请注意,从该结果中删除/突出显示保单日期以外的索赔是很简单的。

版本1(更新数据表v1.9.4+)

试试这个:

# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126), 
                     EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                     EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))

# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                   policyNumber=c(123,123,123,124,126,126),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                   claimAmount=c(10,20,20,15,5,25))

# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)

# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]

# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]

# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)

# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))

ans
#   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
#1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
#2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
#3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
#4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
#5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
#6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
#7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

第3版

使用
foverlaps()
,另一个版本:

require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))
foverlaps()
需要开始和结束范围/间隔。因此,我们将
lossDate
列复制到
lossDate2

由于
EXDT
需要是open interval,因此我们从中减去一个,并将其放置在新列
EXDTclosed

现在,我们设定了关键点
foverlaps()
要求最后两个键列为间隔。所以它们是最后指定的。我们还希望通过
policyNumber
将重叠连接到第一个匹配项。因此,它也在密钥中指定

我们需要在
索赔
(检查
?foverlaps
)上设置密钥。我们不必设置
策略上的键
。但是,如果愿意,您可以这样做(然后您可以跳过
by.x
参数,因为默认情况下它接受键值)。因为我们在这里没有为
策略设置键
,所以我们将在
by.x
参数中明确指定相应的列。默认情况下,重叠类型为
any
,我们无需更改(因此未指定)。这导致:

#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31
#保单号索赔号损失索赔挂载损失索赔2 EFDT EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
#5:125 NA NA 2013-02-01 2014-02-01 2014-01-31

当您真正想做的是匹配一个范围时,我在使用roll时遇到了问题,可能会出现重复。如果您在获得结果时遇到困难,您希望一种方法是将范围转换为每个可能值的唯一行。一个例子就是这个问题,奇怪的是,埃迪回答了这个问题@DeanMacGregor这是在我学会如何使用
roll
之前写的:)你可以在那里看到另一篇SO帖子的评论,其中一个非常类似的问题是使用
roll
@eddi解决的。我知道
roll
应该可以解决这些问题,但“旧方法”似乎更适合以奇怪的方式进行数据重叠。我还遇到过其他一些问题,
roll
应该可以工作,但我想可能是我的数据集有奇怪的重叠,或者是因为其他原因,它没有给出我预期的结果。长话短说,第一次尝试当然是
roll
,但如果你因为某些不匹配而把头撞到墙上,那么可能需要进行预转换。@eddi
roll
似乎会删除我的lossDate列。你知道我如何在结果中保留该栏吗?谢谢你的回复。它第一次设置ans时抛出了一个错误。找不到对象“EFDT”。。。想知道
data.table
版本是否会有所不同(我使用的是1.8.11)?作为一种解决方法,请查看如果从该行代码中去掉最后一位(
[,EFDT:=NULL]
),会发生什么情况。@dnkbrky,是的,在1.8.11中,键列在没有by的by期间在
j
中也可见。它或多或少与@BenGorman有关,当您运行
策略[claims]
时,得到的列名是什么?+1@dnlbrky,使用1.9.4+中的
foverlaps()
函数可以简化此过程。你想试试吗?
#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31