r-重复记录以指示开始和结束时间,并在新列中按顺序标记它们
我对下表有两个数据操作请求 我想:r-重复记录以指示开始和结束时间,并在新列中按顺序标记它们,r,time,duplicates,timestamp,row,tidyverse,data.table,R,Time,Duplicates,Timestamp,Row,Tidyverse,Data.table,我对下表有两个数据操作请求 我想: 复制每个子工作记录,并在名为status的新变量中将其标记为开始和结束。为此,必须将以下子工作的时间戳编码为前一个子工作的结束时间。对于每个工作,最后一个子工作的开始和结束时间戳将是相同的,因为没有后续子工作 创建一个名为subWorkInstanceID的列,该列指示每个不同工作ID中子工作的顺序 注:原表有数百万条记录;因此,如果可能的话,我很欣赏一个快速的解决方案 提前谢谢 原始格式: 期望输出: 创建示例表: dt使用和rleid的解决方案 下面是
dt使用和rleid的解决方案
下面是一个使用data.table的解决方案。解释内联
library(data.table)
setDT(dt)
#create a end time and subWorkInstanceID
wideDT <- dt[, list(subWorkID=subWorkID,
subWorkInstanceID=seq_len(.N),
start=timeStamp,
end=shift(timeStamp, fill=timeStamp[.N], type="lead")),
by=.(workID)]
#melt into OP's desired long format
res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
setorder(res, workID, subWorkID, subWorkInstanceID)
res
# workID subWorkID subWorkInstanceID status timeStamp
# 1: w1 a 1 start 2015-01-08 13:27:14
# 2: w1 a 1 end 2015-01-08 15:45:43
# 3: w1 b 2 start 2015-01-08 15:45:43
# 4: w1 b 2 end 2015-01-08 15:53:36
# 5: w1 c 3 start 2015-01-08 15:53:36
# 6: w1 c 3 end 2015-01-08 16:15:08
# 7: w1 e 4 start 2015-01-08 16:15:08
# 8: w1 e 4 end 2015-01-08 16:15:08
# 9: w2 a 1 start 2015-04-13 13:34:33
#10: w2 a 1 end 2015-04-13 13:36:13
#11: w2 b 2 start 2015-04-13 13:36:13
#12: w2 b 2 end 2015-04-13 13:39:20
#13: w2 k 3 start 2015-04-13 13:39:20
#14: w2 k 3 end 2015-04-13 13:39:20
库(data.table)
setDT(dt)
#创建结束时间和子工作instanceId
wideDT您可以使用数据表
library(data.table)
setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
status=list(rep(c("start","end"),length=nrow(s))),
timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
by=workID][,-3]
workID subWorkID subWorkInstanceID status timestamp
1: w1 a 1 start 2015-01-08 13:27:14
2: w1 a 1 end 2015-01-08 15:45:43
3: w1 b 2 start 2015-01-08 15:45:43
4: w1 b 2 end 2015-01-08 15:53:36
5: w1 c 3 start 2015-01-08 15:53:36
6: w1 c 3 end 2015-01-08 16:15:08
7: w1 e 4 start 2015-01-08 16:15:08
8: w1 e 4 end 2015-01-08 16:15:08
9: w2 a 1 start 2015-04-13 13:34:33
10: w2 a 1 end 2015-04-13 13:36:13
11: w2 b 2 start 2015-04-13 13:36:13
12: w2 b 2 end 2015-04-13 13:39:20
13: w2 k 3 start 2015-04-13 13:39:20
14: w2 k 3 end 2015-04-13 13:39:20
库(data.table)
setDT(dt)[,c请参阅我的答案。您的原始格式与代码创建的数据集不同,因为您的代码将导致7行。但我认为我的解决方案仍然有效。您将是正确的;我将纠正这一点。我将检查两个答案并让您知道。谢谢您的两个答案。您可以考虑上面的Edg2在主要问题吗?更新COMP删除。工作得很好!最后,如果您能简单地解释一下您的计算逻辑,我将不胜感激。@kzmlbyrk我很高兴它能工作。我添加了一些简单的解释。
structure(list(workID = c("w1", "w1", "w1", "w1", "w2", "w2",
"w2"), subWorkID = c("a", "b", "c", "a", "a", "b", "k"), timeStamp = structure(c(1420741634,
1420749943, 1420750416, 1420751708, 1428946473, 1428946573, 1428946760
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("workID",
"subWorkID", "timeStamp"), class = "data.frame", row.names = c(NA,
-7L))
library(data.table)
setDT(dt)
#create a end time and subWorkInstanceID
wideDT <- dt[, list(subWorkID=subWorkID,
subWorkInstanceID=seq_len(.N),
start=timeStamp,
end=shift(timeStamp, fill=timeStamp[.N], type="lead")),
by=.(workID)]
#melt into OP's desired long format
res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
setorder(res, workID, subWorkID, subWorkInstanceID)
res
# workID subWorkID subWorkInstanceID status timeStamp
# 1: w1 a 1 start 2015-01-08 13:27:14
# 2: w1 a 1 end 2015-01-08 15:45:43
# 3: w1 b 2 start 2015-01-08 15:45:43
# 4: w1 b 2 end 2015-01-08 15:53:36
# 5: w1 c 3 start 2015-01-08 15:53:36
# 6: w1 c 3 end 2015-01-08 16:15:08
# 7: w1 e 4 start 2015-01-08 16:15:08
# 8: w1 e 4 end 2015-01-08 16:15:08
# 9: w2 a 1 start 2015-04-13 13:34:33
#10: w2 a 1 end 2015-04-13 13:36:13
#11: w2 b 2 start 2015-04-13 13:36:13
#12: w2 b 2 end 2015-04-13 13:39:20
#13: w2 k 3 start 2015-04-13 13:39:20
#14: w2 k 3 end 2015-04-13 13:39:20
library(data.table)
setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
status=list(rep(c("start","end"),length=nrow(s))),
timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
by=workID][,-3]
workID subWorkID subWorkInstanceID status timestamp
1: w1 a 1 start 2015-01-08 13:27:14
2: w1 a 1 end 2015-01-08 15:45:43
3: w1 b 2 start 2015-01-08 15:45:43
4: w1 b 2 end 2015-01-08 15:53:36
5: w1 c 3 start 2015-01-08 15:53:36
6: w1 c 3 end 2015-01-08 16:15:08
7: w1 e 4 start 2015-01-08 16:15:08
8: w1 e 4 end 2015-01-08 16:15:08
9: w2 a 1 start 2015-04-13 13:34:33
10: w2 a 1 end 2015-04-13 13:36:13
11: w2 b 2 start 2015-04-13 13:36:13
12: w2 b 2 end 2015-04-13 13:39:20
13: w2 k 3 start 2015-04-13 13:39:20
14: w2 k 3 end 2015-04-13 13:39:20