R 如何根据两个数据帧中最近的日期进行匹配?

R 如何根据两个数据帧中最近的日期进行匹配?,r,R,假设我有两个数据帧,如: set.seed(123) df1<-data.frame(bmi=rnorm(20, 25, 5), date1=sample(seq.Date(as.Date("2014-01-01"), as.Date("2014-02-28"),by="day"), 20)) df2<-data.frame(epi=1:5, date2=as

假设我有两个数据帧,如:

set.seed(123)
df1<-data.frame(bmi=rnorm(20, 25, 5),
                date1=sample(seq.Date(as.Date("2014-01-01"),
                             as.Date("2014-02-28"),by="day"), 20))

df2<-data.frame(epi=1:5, 
                date2=as.Date(c("2014-1-8", "2014-1-15", "2014-1-28", 
                                "2014-2-05", "2014-2-24")))

一种方法是使用
data.table
包中的
roll=Inf
功能,如下所示:

require(data.table)   ## >= 1.9.2
setDT(df1)            ## convert to data.table by reference
setDT(df2)            ## same

df1[, date := date1]  ## create a duplicate of 'date1'
setkey(df1, date1)    ## set the column to perform the join on
setkey(df2, date2)    ## same as above

ans = df1[df2, roll=Inf] ## perform rolling join

## change names and set column order as required, by reference
setnames(ans, c('date','date1'), c('date1','date2'))
setcolorder(ans, c('epi', 'date1', 'bmi', 'date2'))

> ans
#   epi      date1      bmi      date2
#1:   1 2014-01-08 33.57532 2014-01-08
#2:   2 2014-01-15 22.63604 2014-01-15
#3:   3 2014-01-26 22.22079 2014-01-28
#4:   4 2014-02-01 15.16691 2014-02-05
#5:   5 2014-02-15 27.48925 2014-02-24
这里有一个方法与基地R

# get time differences
temp <- outer(df2$date2, df1$date1,  "-")

# remove where date1 are after date2
temp[temp < 0] <- NA

# find index of minimum
ind <- apply(temp, 1, function(i) which.min(i))

# output
df2 <- cbind(df2,  df1[ind,])
#获取时差

temp基于查找最近日期索引的替代方案

library(tidyverse)
# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
  which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}

# Calculate the indices
closest_or_after_ind <- map_int(.x = df2$date2, .f = Ind_closest_or_after, d2 = df1$date1)

# Add index columns to the data frames and join
df1 <- df1 %>% 
  mutate(ind = 1:nrow(df1))

df2 <- df2 %>% 
  mutate(ind = closest_or_after_ind)

left_join(df2, df1, by = 'ind')
库(tidyverse)
#函数获取指定最近或之后的索引

谢谢你,阿伦!但是bmi是我的例子中日期1之前或当天的bmi。是的,这太棒了!感谢Arun+1,我不喜欢吹毛求疵,但为了精确起见:
date
列包含来自
date1
的值,而
date1
列包含来自
date2
的值。所以
setnames
应该更像
setnames(ans,c('date','date1'),c('date1','date2'))
@BarbaraBukhvalova,对。请随意编辑代码。我很乐意批准编辑。芭芭拉,批准了。再次感谢。你能澄清一下
map int
部分吗?感谢您对对比度
数据的
dplyr
回答。表
感谢@Jeff Parker,请参阅
library(tidyverse)
# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
  which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}

# Calculate the indices
closest_or_after_ind <- map_int(.x = df2$date2, .f = Ind_closest_or_after, d2 = df1$date1)

# Add index columns to the data frames and join
df1 <- df1 %>% 
  mutate(ind = 1:nrow(df1))

df2 <- df2 %>% 
  mutate(ind = closest_or_after_ind)

left_join(df2, df1, by = 'ind')