R将一个数据帧中的日期时间有条件地匹配到第二个数据帧中最近的日期时间字段

R将一个数据帧中的日期时间有条件地匹配到第二个数据帧中最近的日期时间字段,r,datetime,merge,dataframe,time-series,R,Datetime,Merge,Dataframe,Time Series,我有两个数据帧,df.events和df.activ df.activ具有非常细粒度的分钟级数据,并且比df.events多出一个数量级的记录(1000000+),df.events具有约100000条记录,同样在分钟级粒度。这两个数据帧有两个公共字段DateTime和Geo。这两个DateTime列都是as.POSIXlt,%Y-%m-%d%H:%m:%S格式 df.activ <- read.table(text= '"DateTim

我有两个数据帧,df.events和df.activ

df.activ具有非常细粒度的分钟级数据,并且比df.events多出一个数量级的记录(1000000+),df.events具有约100000条记录,同样在分钟级粒度。这两个数据帧有两个公共字段DateTime和Geo。这两个DateTime列都是as.POSIXlt,%Y-%m-%d%H:%m:%S格式

df.activ <- read.table(text=
                          '"DateTime","Geo","Bin1","Bin2"
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,510,0,1
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:13:00,618,1,1
                        2014-07-01 00:13:00,510,0,1
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")

df.events <- read.table(text=
                          '"Units","Geo","DateTime"
                        225,999,2014-07-01 00:09:00
                        40,510,2014-07-01 00:12:00
                        5,999,2014-07-01 00:28:00
                        115,999,2014-07-01 00:44:00
                        0,999,2014-07-01 00:47:00',header=TRUE,sep=",")

df.activ有时很难避免循环,尤其是当您遇到类似的情况时。有时我们会花费大量精力来避免它们,而它们可能是我们所能做到的最好的,或者在性能和/或可读性方面并不落后太多。话虽如此,这将起到以下作用:

df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00

df.activ$DateTime我在工作,这似乎相对解决了,所以我会简短地说。您还可以进行一次完整的外部合并,然后简单地计算日期的差异。使用按日期差异的绝对值排序的distinct


这可能是算法上最快的合并方式,但需要比循环更多的内存(完整合并将有n1*n2个观察值)。

我不明白为什么Geo==510在df中出现两次。理想结果你能回答你的问题吗?@实验者A输入错误A假设,因为在数据中第7行应该是507。是的,那是我的错误。我现在已经改正了。为可能造成的混乱道歉。@leaRningR909我知道多米尼克已经给了你一个很好的答案。万一你想提高性能,你可能要考虑从<代码>数据>表< /COD>包或<代码>内NeLyJoin <代码> > <代码> DPLYR 包。谢谢多米尼克。这很有效。只是一个快速跟进的问题:如果我要将此代码扩展到具有数百万条以上记录的数据帧,那么在这种情况下避免循环是R的最佳实践吗?我将尝试修改[this post]()中的函数+sapply方法,并与system.time和loop方法进行正面对话。非常欢迎!对于你的问题,很难找到一个明确的答案。关于这个话题,已经说了很多,现在仍然在说。我不习惯使用大型数据库,所以一般来说我对循环很在行。我不认为一百万行会使它成为一个问题,除非你有一个内存不足的机器(即使这样,问题更多的是速度而不是内存)。但我很想知道你的情况如何;如果您提出了其他解决方案,请随时向我们更新您的结果。
df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00