R 折叠多行,将某些行的值保留在一个变量中,将另一行的值保留在另一个变量中

R 折叠多行,将某些行的值保留在一个变量中,将另一行的值保留在另一个变量中,r,data.table,data-manipulation,R,Data.table,Data Manipulation,我有一个数据框,报告合同的开始和结束日期,如下所示: df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164), employeesID = c("Alf", "Alf","Alf", "Alf", "A

我有一个数据框,报告合同的开始和结束日期,如下所示:



df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164), 
                     employeesID = c("Alf", "Alf","Alf", "Alf", "Alf", "Alf", "Alf", "Alf", "Bet", "Bet", "Bet", "Bet", "Gam", "Gam", "Gam"), 
                     employersID = c("31974", "32009", "32040", "31974", "31974", "358291", "358291", "31665", "31345", "31345", "31345", "31345", "363109", "363109", "363109"), 
                     start_date = structure(c(15613, 15863, 15937, 16295, 16299, 17037, 17045, 17136, 15692, 16097, 16141, 16513, 17116, 17554, 17913), class = "Date"), 
                     end_date = structure(c(15862, 15937, 16295, 16297, 17036, 17044, 17136, NA, 16067, 16141, 16505, NA, 17543, 17907, 18272), class = "Date")), 
                row.names = c(NA,-15L), class = c("data.table", "data.frame"))

    dyadID employeesID employersID start_date   end_date
 1:      2        Alf      31974 2012-09-30 2013-06-06
 2:      3        Alf      32009 2013-06-07 2013-08-20
 3:      4        Alf      32040 2013-08-20 2014-08-13
 4:      2        Alf      31974 2014-08-13 2014-08-15
 5:      2        Alf      31974 2014-08-17 2016-08-23
 6:      5        Alf     358291 2016-08-24 2016-08-31
 7:      5        Alf     358291 2016-09-01 2016-12-01
 8:      1        Alf      31665 2016-12-01       <NA>
 9:  13765        Bet      31345 2012-12-18 2013-12-28
10:  13765        Bet      31345 2014-01-27 2014-03-12
11:  13765        Bet      31345 2014-03-12 2015-03-11
12:  13765        Bet      31345 2015-03-19       <NA>
13:  43164        Gam     363109 2016-11-11 2018-01-12
14:  43164        Gam     363109 2018-01-23 2019-01-11
15:  43164        Gam     363109 2019-01-17 2020-01-11
    dyadID employeesID employersID start_date   end_date
 1:      2        Alf      31974 2012-09-30 2013-06-06
 2:      3        Alf      32009 2013-06-07 2013-08-20
 3:      4        Alf      32040 2013-08-20 2014-08-13
 5:      2        Alf      31974 2014-08-13 2016-08-23 # collapsed observation (one time), keeping start_date of the first collapsed observation and end_date of the last collapsed observation
 6:      5        Alf     358291 2016-08-24 2016-12-01 # collapsed one time
 7:      1        Alf      31665 2016-12-01       <NA>
 8:  13765        Bet      31345 2012-12-18       <NA> # collapsed observation (3 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
13:  43164        Gam     363109 2016-11-11 2020-01-11 # collapsed observation (2 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
出于这个目的,我尝试了以下方法,但它看起来不太直截了当,当我需要多次调整日期时,它也不起作用

df <- setDT(df)[order(employeesID,start_date), same_dyd := ifelse(dyadID==lag(dyadID),1,0),
by=.(employeesID) # this identifies the observations I need to collapse
               ][is.na(same_dyd),same_dyd:=0
      ][order(employeesID,start_date), 
new_start_date:=if_else(same_dyd==1,lag(start_date),start_date)] # this creates a new variable with the correct date when there is only one new contract. 

df我们可以根据'dyadID','dyadID',employeesID','employeersid'的游程长度id进行分组,通过分别获取'start_date'和'end_date'的
第一个
最后一个
元素进行总结

library(data.table)
df[, .(start_date = first(start_date),
   end_date = last(end_date)),
     .(grp = rleid(dyadID), dyadID, employeesID, employersID)]

如果要保留每组第一行的列值,请使用
.I
创建一个行索引,并使用该索引从原始数据中提取不在摘要中的行和列

out <- df[, .(start_date = first(start_date),
 end_date = last(end_date), rn = .I[1]),
   .(grp = rleid(dyadID), dyadID, employeesID, employersID)]
cbind(out, df[out$rn, setdiff(names(df), names(out)), with = FALSE])

out哇!!!!这是惊人的简洁和高效!!!谢谢@akrun!我做梦也不会想到这样的事情!你介意详细解释一下这背后的逻辑吗?@Alex有一行不匹配。是因为
shift(开始日期,类型='lead')==end\u日期
您是指原始df上的第7行吗?你说得对,我并没有错把它弄坏。您的输出是正确的。我还想知道是否有一种方法可以使用相同的方法,但同时在df中保留其他变量(对于相同的二元数应该是相同的)?@Alex这是可以做到的,但我有一个疑问。你想留在这里的那一排。使用
开始日期
,我们将保留第一个元素和
结束日期
,最后一个
元素非常有用!!非常感谢你!是的,我编辑了这个