R 按组和逻辑表达式子集行-data.table

R 按组和逻辑表达式子集行-data.table,r,data.table,R,Data.table,假设我有这些数据。我正在对数据进行子集设置,这样我只保留一行,如果它比相同颜色的前一行大5秒以上。我特别想使用data.table来提高速度 示例数据 timestamp Color var1 1: 2015-04-04 16:56:52 red group1 2: 2015-04-04 16:56:53 red group1 3: 2015-04-04 16:56:54 red group1 4: 2015-04-04 16

假设我有这些数据。我正在对数据进行子集设置,这样我只保留一行,如果它比相同颜色的前一行大5秒以上。我特别想使用
data.table
来提高速度

示例数据

               timestamp  Color   var1
  1: 2015-04-04 16:56:52    red group1
  2: 2015-04-04 16:56:53    red group1
  3: 2015-04-04 16:56:54    red group1
  4: 2015-04-04 16:57:06    red group1
  5: 2015-04-04 16:57:07    red group1
  6: 2015-04-04 16:57:09    red group1
  7: 2015-04-04 16:57:10    red group1
  8: 2015-04-04 16:57:11    red group1
  9: 2015-04-04 16:57:12    red group1
 10: 2015-04-04 16:57:13    red group1
 11: 2015-04-04 16:57:14    red group1
 12: 2015-04-04 16:57:15    red group1
 13: 2015-04-04 16:57:17    red group1
 14: 2015-04-04 16:57:18    red group1
 15: 2015-04-04 16:57:19    red group1
 16: 2015-04-04 16:57:20    red group1
 17: 2015-04-04 16:57:21    red group1
 18: 2015-04-04 16:57:22    red group1
 19: 2015-04-04 16:57:23    red group1
 20: 2015-04-04 16:57:24    red group1
 21: 2015-04-04 16:57:25    red group1
 22: 2015-04-04 16:57:26    red group1
 23: 2015-04-04 16:57:27    red group1
 24: 2015-04-04 16:57:39    red group1
 25: 2015-04-04 16:57:40    red group1
 26: 2015-04-04 16:57:41    red group1
 27: 2015-04-04 16:58:02    red group1
 28: 2015-04-04 16:58:31 yellow group1
 29: 2015-04-04 16:58:31 yellow group1
 30: 2015-04-04 16:58:32 yellow group1
 31: 2015-04-04 16:58:34    red group1
 32: 2015-04-04 16:58:35    red group1
 33: 2015-04-04 16:58:36    red group1
 34: 2015-04-04 16:58:37    red group1
 35: 2015-04-04 16:58:38    red group1
 36: 2015-04-04 16:58:39    red group1
 37: 2015-04-04 16:58:40    red group1
 38: 2015-04-04 16:58:41    red group1
 39: 2015-04-04 16:58:42    red group1
 40: 2015-04-04 16:58:43    red group1
 41: 2015-04-04 16:58:44    red group1
 42: 2015-04-04 16:58:45    red group1
 43: 2015-04-04 16:58:46    red group1
 44: 2015-04-04 16:58:47    red group1
 45: 2015-04-04 16:58:48    red group1
 46: 2015-04-04 16:58:49    red group1
 47: 2015-04-04 16:58:50    red group1
 48: 2015-04-04 16:58:51    red group1
 49: 2015-04-04 16:58:52    red group1
 50: 2015-04-04 16:58:53    red group1
 51: 2015-04-04 16:58:54    red group1
 52: 2015-04-04 16:58:55    red group1
 53: 2015-04-04 16:58:56    red group1
 54: 2015-04-04 16:58:57    red group1
 55: 2015-04-04 16:58:58    red group1
 56: 2015-04-04 16:58:59    red group1
 57: 2015-04-04 16:59:00    red group1
 58: 2015-04-04 16:59:01    red group1
 59: 2015-04-04 16:59:02    red group1
 60: 2015-04-04 16:59:03    red group1
 61: 2015-04-04 16:59:04    red group1
 62: 2015-04-04 16:59:05    red group1
 63: 2015-04-04 16:59:06    red group1
 64: 2015-04-04 16:59:07    red group1
 65: 2015-04-04 16:59:08    red group1
 66: 2015-04-04 16:59:09    red group1
 67: 2015-04-04 16:59:10    red group1
 68: 2015-04-04 16:59:11    red group1
 69: 2015-04-04 16:59:12    red group1
 70: 2015-04-04 16:59:13    red group1
 71: 2015-04-04 16:59:14    red group1
 72: 2015-04-04 16:59:15    red group1
 73: 2015-04-04 16:59:16    red group1
 74: 2015-04-04 16:59:17    red group1
 75: 2015-04-04 16:59:18    red group1
 76: 2015-04-04 16:59:19    red group1
 77: 2015-04-04 16:59:20    red group1
 78: 2015-04-04 16:59:21    red group1
 79: 2015-04-04 16:59:22    red group1
 80: 2015-04-04 16:59:23    red group1
 81: 2015-04-04 16:59:24    red group1
 82: 2015-04-04 16:59:25    red group1
 83: 2015-04-04 16:59:26    red group1
 84: 2015-04-04 16:59:27    red group1
 85: 2015-04-04 16:59:28    red group1
 86: 2015-04-04 16:59:29    red group1
 87: 2015-04-04 16:59:33 yellow group1
 88: 2015-04-04 16:59:59 yellow group1
 89: 2015-04-04 17:00:00 yellow group1
 90: 2015-04-04 17:00:01 yellow group1
 91: 2015-04-04 17:00:02 yellow group1
 92: 2015-04-04 17:00:03 yellow group1
 93: 2015-04-04 17:00:32 yellow group1
 94: 2015-04-04 17:00:33 yellow group1
 95: 2015-04-04 17:00:45    red group1
 96: 2015-04-04 17:00:46    red group1
 97: 2015-04-04 17:00:47 yellow group1
 98: 2015-04-04 17:00:47    red group1
 99: 2015-04-04 17:00:48 yellow group1
100: 2015-04-04 17:00:49 yellow group1
               timestamp  Color   var1
以下是我到目前为止得到的信息:

DT[DT[, .I[timestamp - lag(timestamp)>5], by = Color]$V1]
这给了我这样的信息:

              timestamp  Color   var1
 1:                <NA>     NA     NA
 2: 2015-04-04 16:57:06    red group1
 3: 2015-04-04 16:57:39    red group1
 4: 2015-04-04 16:58:02    red group1
 5: 2015-04-04 16:58:34    red group1
 6: 2015-04-04 17:00:45    red group1
 7:                <NA>     NA     NA
 8: 2015-04-04 16:59:33 yellow group1
 9: 2015-04-04 16:59:59 yellow group1
10: 2015-04-04 17:00:32 yellow group1
11: 2015-04-04 17:00:47 yellow group1
时间戳颜色变量1
1:NA-NA
2:2015-04-04 16:57:06红色组1
3:2015-04-04 16:57:39红色组1
4:2015-04-04 16:58:02红色组1
5:2015-04-04 16:58:34红色组1
6:2015-04-04 17:00:45红色组1
7:NA-NA
8:2015-04-04 16:59:33黄色组1
9:2015-04-04 16:59:59黄色组1
10:2015-04-04 17:00:32黄色组1
11:2015-04-04 17:00:47黄色组1
这似乎行得通。但是,我还希望保留每组的第一行(颜色)。这里很明显,它返回为NA,因为这是逻辑表达式的结果。是否有一种方法可以执行此操作并将第一行保留在一个表达式中,而无需重新插入这些行

用于复制示例的数据

DT <- structure(list(timestamp = structure(c(1428181012, 1428181013, 
1428181014, 1428181026, 1428181027, 1428181029, 1428181030, 1428181031, 
1428181032, 1428181033, 1428181034, 1428181035, 1428181037, 1428181038, 
1428181039, 1428181040, 1428181041, 1428181042, 1428181043, 1428181044, 
1428181045, 1428181046, 1428181047, 1428181059, 1428181060, 1428181061, 
1428181082, 1428181111, 1428181111, 1428181112, 1428181114, 1428181115, 
1428181116, 1428181117, 1428181118, 1428181119, 1428181120, 1428181121, 
1428181122, 1428181123, 1428181124, 1428181125, 1428181126, 1428181127, 
1428181128, 1428181129, 1428181130, 1428181131, 1428181132, 1428181133, 
1428181134, 1428181135, 1428181136, 1428181137, 1428181138, 1428181139, 
1428181140, 1428181141, 1428181142, 1428181143, 1428181144, 1428181145, 
1428181146, 1428181147, 1428181148, 1428181149, 1428181150, 1428181151, 
1428181152, 1428181153, 1428181154, 1428181155, 1428181156, 1428181157, 
1428181158, 1428181159, 1428181160, 1428181161, 1428181162, 1428181163, 
1428181164, 1428181165, 1428181166, 1428181167, 1428181168, 1428181169, 
1428181173, 1428181199, 1428181200, 1428181201, 1428181202, 1428181203, 
1428181232, 1428181233, 1428181245, 1428181246, 1428181247, 1428181247, 
1428181248, 1428181249), class = c("POSIXct", "POSIXt"), tzone = ""), 
    Color = c("red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "yellow", "yellow", "yellow", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "red", "red", "red", "red", "red", "red", "red", "red", "red", 
    "yellow", "yellow", "yellow", "yellow", "yellow", "yellow", 
    "yellow", "yellow", "red", "red", "yellow", "red", "yellow", 
    "yellow"), var1 = c("group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1", 
    "group1", "group1", "group1", "group1", "group1", "group1"
    )), .Names = c("timestamp", "Color", "var1"), row.names = c(NA, 
-100L), class = c("data.table", "data.frame"))

DT我们按“颜色”分组,得到第一行的行索引(
.I[1L]
),并与我们从大于5的相邻元素的差异中得到的行索引连接。注意,我们使用了
fill
参数来确保没有
NA
元素。(NA
元素将不与
.I
一起工作,并提供一个附加的NA行。)提取索引列($V1),并将数据集子集为OP的帖子中的数据集

 DT[DT[, c(.I[1L],.I[(timestamp - shift(timestamp, 
             fill = timestamp[1L]) )>5]) , Color]$V1]

我认为最好分两步进行(可以使用
fill
删除NA行)
DT1 5],by=Color]$V1];DT2很有趣。如果我们有比“颜色”更多的分组变量,只需将
list(Color,Var2,Var3)
添加到每行的
Color
部分就可以了吗?我在下面发布了一个简洁的解决方案。我想这就是你想要的。。对于更多变量,是的,需要使用
rbindlist
解决方案键入更多变量,因为我们必须将它们放入
列表中
或使用
(颜色、变量等)