R 使用阈值标识重复数据
我正在处理蓝牙传感器数据,需要识别每个唯一ID的可能重复读数。蓝牙传感器每五秒钟进行一次扫描,如果设备移动不快(例如,在交通中),可能会在后续读数中拾取同一设备。如果同一设备往返,可能会有多个读数,但这些读数应间隔几分钟。我不知道如何消除重复数据。如果macid匹配,我需要计算一个时差列 数据的格式如下:R 使用阈值标识重复数据,r,R,我正在处理蓝牙传感器数据,需要识别每个唯一ID的可能重复读数。蓝牙传感器每五秒钟进行一次扫描,如果设备移动不快(例如,在交通中),可能会在后续读数中拾取同一设备。如果同一设备往返,可能会有多个读数,但这些读数应间隔几分钟。我不知道如何消除重复数据。如果macid匹配,我需要计算一个时差列 数据的格式如下: macid time 00:03:7A:4D:F3:59 82333 00:03:7A:EF:58:6F 223556 00:03:7A:EF:58:6F 22360
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
我需要创造:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
我的第一次尝试速度非常慢,而且不太实用:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
那么:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))
x完美,我忘记了ave
。我把rle
的一些东西放在一起,类似于,但这更直接,更切题。非常感谢。
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))