R:生成包含日期范围分组的子数据框

R:生成包含日期范围分组的子数据框,r,R,我有一个包含两列的数据框,一个标识符和一个日期。下面的代码创建了一个示例数据帧 x <- c(rep(c("a","b"), each=10), rep(c("c", "d"), each=5)) y <- c(seq(as.Date("2014-01-01"), as.Date("2014-01-05"), by = 1), as.Date("2014-03-12"), as.Date("2014-03-15"), seq(as.Date("2014-0

我有一个包含两列的数据框,一个标识符和一个日期。下面的代码创建了一个示例数据帧

x <- c(rep(c("a","b"), each=10), rep(c("c", "d"), each=5))
y <- c(seq(as.Date("2014-01-01"), as.Date("2014-01-05"), by = 1), 
    as.Date("2014-03-12"), 
    as.Date("2014-03-15"),
    seq(as.Date("2014-05-11"), as.Date("2014-05-13"), by = 1),
    seq(as.Date("2014-06-11"), as.Date("2014-06-14"), by = 1),
    seq(as.Date("2014-06-01"), as.Date("2014-06-20"), by = 2),
    seq(as.Date("2014-07-31"), as.Date("2014-08-05"), by = 1))  

df <- data.frame(x = x, y = y)  
x试试看

解释 我将尝试通过拆分
data.table中的代码来解释

library(data.table)
 DT1 <- setDT(df)[,indx:= cumsum(c(TRUE, diff(y)!=1)),
          by=x][,list(start.rng=y[1], end.rng=y[.N], days.rng=.N),
          by=list(x, indx)][, indx:=NULL] 

  head(DT1)
 #   x  start.rng    end.rng days.rng
 #1: a 2014-01-01 2014-01-05        5
 #2: a 2014-03-12 2014-03-12        1
 #3: a 2014-03-15 2014-03-15        1
 #4: a 2014-05-11 2014-05-13        3
 #5: b 2014-06-11 2014-06-14        4
 #6: b 2014-06-01 2014-06-01        1
  • 检查每个
    x
    组的
    y
    中连续行值之间的差异

       setDT(df)[, #converts `df` from `data.frame` to `data.table`
        indx:=  #create an index 
      c(0, diff(y)), by=x] #calculates the difference between consecutive `y` elements
         #for each `x` group.  Here `diff` returns one element less than the length of each `x` group.  So, I appended `0` to it.  It can be any value other than `1` so that in the next step, I can use it to create a `grouping` index
    
  • 从上述步骤创建
    indx
    的分组索引

     df[, indx1:=cumsum(indx!=1), by=x] # you can check the result of this step to understand the process.  
    
  • 除了
    x
    之外,使用
    indx1
    作为新的分组变量,我们检查
    y
    first
    last

        df1 <-  df[, 
           list(start.rng=y[1], #first y value 
            end.rng=y[.N], #last y value .N signifies the length of each group
            day.rng=.N),  #group length i.e. .N
             by=list(x, indx1)] #grouped by x and indx1 
    

谢谢-很有效。仍在试图弄清楚你是如何做到的:-)@chribonn我更新了解决方案,并做了一些解释。希望能有帮助
    df1 <-  df[, 
       list(start.rng=y[1], #first y value 
        end.rng=y[.N], #last y value .N signifies the length of each group
        day.rng=.N),  #group length i.e. .N
         by=list(x, indx1)] #grouped by x and indx1 
   df1[, indx1:=NULL]