在R中:根据时间段条件连接两个数据帧

在R中:根据时间段条件连接两个数据帧,r,dataframe,merge,dplyr,R,Dataframe,Merge,Dplyr,作为R的新手,我正试图通过考虑一个时间段条件来合并两个数据帧 df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019

作为R的新手,我正试图通过考虑一个时间段条件来合并两个数据帧

df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), "exp_id" = NA)

df2 <- data.frame("device_serial" =  c("123","123") , exp_id= c("a","b") ,    start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,       end_timestamp = c("2020-01-17 00:05:10", NULL)     ,    current_event_id = c("1", "2")   ,current_event_timestamp= c("2020-01-17 00:05:09", "2020-01-17 00:05:09"))
我要查找的结果是一个类似以下df3的表:


感谢您阅读此问题并帮助我解决它。

如果我理解正确,这里有一些建议

首先是您的数据,需要进行一些编辑:

根据@r2evans的评论,我假设空值应该是 纳乌雷亚尔 第一个数据块中df2的当前\u事件\u时间戳 代码与您在第二个块中键入的代码不匹配;我曾经 从第二个区块开始的日期时间,因为它导致了您的答案 寻找 df1% as_tibbledf1%>%转换为tibble;打印每列的数据类型 选择-exp\u id,evnt\u start=start\u timestamp,evnt\u end=end\u timestamp%>%删除exp\u id不是必需的,会弄乱连接并更改时间列的名称。 mutateevnt_start=as_datetimeevnt_start,将时间列转换为datetime类型 evnt\U end=作为日期时间evnt\U end df1 一个tibble:3x5 第一个事件第二个事件设备串行evnt启动evnt结束 1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38 2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28 3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28 df2% 作为tibbledf2%>%转换为tibble 重命名exp\u start=start\u时间戳,exp\u end=end\u时间戳%>%更改时间列的名称 mutate_at.vars=cexp_start,exp_end,current_event_timestamp,~as_datetime。将时间列从factor转换为datetime类型 df2 一个tibble:3x8 第一个\u事件第二个\u事件设备\u串行evnt\u启动evnt\u结束exp\u id exp\u启动exp\u结束\u或\u当前 1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38 a 2019-12-03 07:12:20 2020-01-17 00:05:10 2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28 3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28 2019-09-04 10:00:00 2019-11-23 12:06:28
dplyr不在时间范围上进行联接,但data.table使用foverlaps或不等式合并进行联接。为了优雅且性能合理,我建议使用data.table,至少对于这个merging.BTW,您的df2$end_时间戳中不应该有NULL。结果是,由于该向量现在已被长度1 null删除,data.frame很高兴地将其带到所有2行的列中,这几乎肯定不是您想要的。你的意思是用NA吗?这是一种解决我的问题的优雅方法,写得并不好。谢谢:@Soren,很乐意帮忙!管理、争论和解释!日期时间数据可能是一个难题。
>df1

first_event   second_event      device_serial      start_timestamp        end_timestamp           exp_id
  4f7d            9346             123           2019-12-06 11:47:0     2020-01-10 12:59:38         NA
  a10a            a839             123             2019-09-06 11:47:0    2019-11-22 12:06:28        NA
  e79b            d939             123           "2019-09-05 10:00:00"    "2019-11-22 12:06:28")    NA

>df2
device_serial   exp_id    start_timestamp        end_timestamp         current_event_id   current_event_timestamp

   123             a      2019-12-03 07:12:20    2020-01-17 00:05:10        1             2020-01-17 00:05:09

   123             b      2019-09-04 10:00:00    NULL                       2             2019-11-23 12:06:28
>df3
first_event   second_event      device_serial      start_timestamp        end_timestamp           exp_id
  4f7d            9346             123           2019-12-06 11:47:0     2020-01-10 12:59:38         a
  a10a            a839             123             2019-09-06 11:47:0    2019-11-22 12:06:28        b
 e79b            d939             123           "2019-09-05 10:00:00"    "2019-11-22 12:06:28")     b
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), 
                  "second_event" = c("9346","a839", "d939"), 
                  "device_serial" = c("123","123","123") , 
                  "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),
                  "end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), 
                  "exp_id" = NA)

df2 <- data.frame("device_serial" =  c("123","123") , 
                  exp_id= c("a","b") ,    
                  start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,       
                  end_timestamp = c("2020-01-17 00:05:10", NA_real_)     ,   
                  current_event_id = c("1", "2")   ,
                  current_event_timestamp= c("2020-01-17 00:05:09", "2019-11-23 12:06:28"))
# A tibble: 2 x 6
  device_serial exp_id exp_start           exp_end             current_event_id current_event_timestamp
  <fct>         <fct>  <dttm>              <dttm>              <fct>            <dttm>                 
1 123           a      2019-12-03 07:12:20 2020-01-17 00:05:10 1                2020-01-17 00:05:09    
2 123           b      2019-09-04 10:00:00 NA                  2                2019-11-23 12:06:28