r-重复记录以指示开始和结束时间,并在新列中按顺序标记它们

r-重复记录以指示开始和结束时间,并在新列中按顺序标记它们,r,time,duplicates,timestamp,row,tidyverse,data.table,R,Time,Duplicates,Timestamp,Row,Tidyverse,Data.table,我对下表有两个数据操作请求 我想: 复制每个子工作记录,并在名为status的新变量中将其标记为开始和结束。为此,必须将以下子工作的时间戳编码为前一个子工作的结束时间。对于每个工作,最后一个子工作的开始和结束时间戳将是相同的,因为没有后续子工作 创建一个名为subWorkInstanceID的列,该列指示每个不同工作ID中子工作的顺序 注:原表有数百万条记录;因此,如果可能的话,我很欣赏一个快速的解决方案 提前谢谢 原始格式: 期望输出: 创建示例表: dt使用和rleid的解决方案 下面是

我对下表有两个数据操作请求

我想:

  • 复制每个子工作记录,并在名为status的新变量中将其标记为开始和结束。为此,必须将以下子工作的时间戳编码为前一个子工作的结束时间。对于每个工作,最后一个子工作的开始和结束时间戳将是相同的,因为没有后续子工作

  • 创建一个名为subWorkInstanceID的列,该列指示每个不同工作ID中子工作的顺序

  • 注:原表有数百万条记录;因此,如果可能的话,我很欣赏一个快速的解决方案

    提前谢谢

    原始格式:

    期望输出:

    创建示例表:

    dt使用和
    rleid的解决方案


    下面是一个使用data.table的解决方案。解释内联

    library(data.table)
    setDT(dt)
    
    #create a end time and subWorkInstanceID
    wideDT <- dt[, list(subWorkID=subWorkID,
            subWorkInstanceID=seq_len(.N),
            start=timeStamp,
            end=shift(timeStamp, fill=timeStamp[.N], type="lead")), 
        by=.(workID)]
    
    #melt into OP's desired long format
    res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
    setorder(res, workID, subWorkID, subWorkInstanceID)
    res
    
    #    workID subWorkID subWorkInstanceID status           timeStamp
    # 1:     w1         a                 1  start 2015-01-08 13:27:14
    # 2:     w1         a                 1    end 2015-01-08 15:45:43
    # 3:     w1         b                 2  start 2015-01-08 15:45:43
    # 4:     w1         b                 2    end 2015-01-08 15:53:36
    # 5:     w1         c                 3  start 2015-01-08 15:53:36
    # 6:     w1         c                 3    end 2015-01-08 16:15:08
    # 7:     w1         e                 4  start 2015-01-08 16:15:08
    # 8:     w1         e                 4    end 2015-01-08 16:15:08
    # 9:     w2         a                 1  start 2015-04-13 13:34:33
    #10:     w2         a                 1    end 2015-04-13 13:36:13
    #11:     w2         b                 2  start 2015-04-13 13:36:13
    #12:     w2         b                 2    end 2015-04-13 13:39:20
    #13:     w2         k                 3  start 2015-04-13 13:39:20
    #14:     w2         k                 3    end 2015-04-13 13:39:20
    
    库(data.table)
    setDT(dt)
    #创建结束时间和子工作instanceId
    
    wideDT您可以使用
    数据表

    library(data.table)
    setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
              status=list(rep(c("start","end"),length=nrow(s))),
              timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
              by=workID][,-3]
    
    
       workID subWorkID subWorkInstanceID status           timestamp
     1:     w1         a                 1  start 2015-01-08 13:27:14
     2:     w1         a                 1    end 2015-01-08 15:45:43
     3:     w1         b                 2  start 2015-01-08 15:45:43
     4:     w1         b                 2    end 2015-01-08 15:53:36
     5:     w1         c                 3  start 2015-01-08 15:53:36
     6:     w1         c                 3    end 2015-01-08 16:15:08
     7:     w1         e                 4  start 2015-01-08 16:15:08
     8:     w1         e                 4    end 2015-01-08 16:15:08
     9:     w2         a                 1  start 2015-04-13 13:34:33
    10:     w2         a                 1    end 2015-04-13 13:36:13
    11:     w2         b                 2  start 2015-04-13 13:36:13
    12:     w2         b                 2    end 2015-04-13 13:39:20
    13:     w2         k                 3  start 2015-04-13 13:39:20
    14:     w2         k                 3    end 2015-04-13 13:39:20
    
    库(data.table)
    
    setDT(dt)[,c请参阅我的答案。您的原始格式与代码创建的数据集不同,因为您的代码将导致7行。但我认为我的解决方案仍然有效。您将是正确的;我将纠正这一点。我将检查两个答案并让您知道。谢谢您的两个答案。您可以考虑上面的Edg2在主要问题吗?更新COMP删除。工作得很好!最后,如果您能简单地解释一下您的计算逻辑,我将不胜感激。@kzmlbyrk我很高兴它能工作。我添加了一些简单的解释。
    structure(list(workID = c("w1", "w1", "w1", "w1", "w2", "w2", 
    "w2"), subWorkID = c("a", "b", "c", "a", "a", "b", "k"), timeStamp = structure(c(1420741634, 
    1420749943, 1420750416, 1420751708, 1428946473, 1428946573, 1428946760
    ), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("workID", 
    "subWorkID", "timeStamp"), class = "data.frame", row.names = c(NA, 
    -7L))
    
    library(data.table)
    setDT(dt)
    
    #create a end time and subWorkInstanceID
    wideDT <- dt[, list(subWorkID=subWorkID,
            subWorkInstanceID=seq_len(.N),
            start=timeStamp,
            end=shift(timeStamp, fill=timeStamp[.N], type="lead")), 
        by=.(workID)]
    
    #melt into OP's desired long format
    res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
    setorder(res, workID, subWorkID, subWorkInstanceID)
    res
    
    #    workID subWorkID subWorkInstanceID status           timeStamp
    # 1:     w1         a                 1  start 2015-01-08 13:27:14
    # 2:     w1         a                 1    end 2015-01-08 15:45:43
    # 3:     w1         b                 2  start 2015-01-08 15:45:43
    # 4:     w1         b                 2    end 2015-01-08 15:53:36
    # 5:     w1         c                 3  start 2015-01-08 15:53:36
    # 6:     w1         c                 3    end 2015-01-08 16:15:08
    # 7:     w1         e                 4  start 2015-01-08 16:15:08
    # 8:     w1         e                 4    end 2015-01-08 16:15:08
    # 9:     w2         a                 1  start 2015-04-13 13:34:33
    #10:     w2         a                 1    end 2015-04-13 13:36:13
    #11:     w2         b                 2  start 2015-04-13 13:36:13
    #12:     w2         b                 2    end 2015-04-13 13:39:20
    #13:     w2         k                 3  start 2015-04-13 13:39:20
    #14:     w2         k                 3    end 2015-04-13 13:39:20
    
    library(data.table)
    setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
              status=list(rep(c("start","end"),length=nrow(s))),
              timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
              by=workID][,-3]
    
    
       workID subWorkID subWorkInstanceID status           timestamp
     1:     w1         a                 1  start 2015-01-08 13:27:14
     2:     w1         a                 1    end 2015-01-08 15:45:43
     3:     w1         b                 2  start 2015-01-08 15:45:43
     4:     w1         b                 2    end 2015-01-08 15:53:36
     5:     w1         c                 3  start 2015-01-08 15:53:36
     6:     w1         c                 3    end 2015-01-08 16:15:08
     7:     w1         e                 4  start 2015-01-08 16:15:08
     8:     w1         e                 4    end 2015-01-08 16:15:08
     9:     w2         a                 1  start 2015-04-13 13:34:33
    10:     w2         a                 1    end 2015-04-13 13:36:13
    11:     w2         b                 2  start 2015-04-13 13:36:13
    12:     w2         b                 2    end 2015-04-13 13:39:20
    13:     w2         k                 3  start 2015-04-13 13:39:20
    14:     w2         k                 3    end 2015-04-13 13:39:20