R 折叠多行,将某些行的值保留在一个变量中,将另一行的值保留在另一个变量中
我有一个数据框,报告合同的开始和结束日期,如下所示:R 折叠多行,将某些行的值保留在一个变量中,将另一行的值保留在另一个变量中,r,data.table,data-manipulation,R,Data.table,Data Manipulation,我有一个数据框,报告合同的开始和结束日期,如下所示: df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164), employeesID = c("Alf", "Alf","Alf", "Alf", "A
df <- structure(list(dyadID = c(2, 3, 4, 2, 2, 5, 5, 1, 13765, 13765, 13765, 13765, 43164, 43164, 43164),
employeesID = c("Alf", "Alf","Alf", "Alf", "Alf", "Alf", "Alf", "Alf", "Bet", "Bet", "Bet", "Bet", "Gam", "Gam", "Gam"),
employersID = c("31974", "32009", "32040", "31974", "31974", "358291", "358291", "31665", "31345", "31345", "31345", "31345", "363109", "363109", "363109"),
start_date = structure(c(15613, 15863, 15937, 16295, 16299, 17037, 17045, 17136, 15692, 16097, 16141, 16513, 17116, 17554, 17913), class = "Date"),
end_date = structure(c(15862, 15937, 16295, 16297, 17036, 17044, 17136, NA, 16067, 16141, 16505, NA, 17543, 17907, 18272), class = "Date")),
row.names = c(NA,-15L), class = c("data.table", "data.frame"))
dyadID employeesID employersID start_date end_date
1: 2 Alf 31974 2012-09-30 2013-06-06
2: 3 Alf 32009 2013-06-07 2013-08-20
3: 4 Alf 32040 2013-08-20 2014-08-13
4: 2 Alf 31974 2014-08-13 2014-08-15
5: 2 Alf 31974 2014-08-17 2016-08-23
6: 5 Alf 358291 2016-08-24 2016-08-31
7: 5 Alf 358291 2016-09-01 2016-12-01
8: 1 Alf 31665 2016-12-01 <NA>
9: 13765 Bet 31345 2012-12-18 2013-12-28
10: 13765 Bet 31345 2014-01-27 2014-03-12
11: 13765 Bet 31345 2014-03-12 2015-03-11
12: 13765 Bet 31345 2015-03-19 <NA>
13: 43164 Gam 363109 2016-11-11 2018-01-12
14: 43164 Gam 363109 2018-01-23 2019-01-11
15: 43164 Gam 363109 2019-01-17 2020-01-11
dyadID employeesID employersID start_date end_date
1: 2 Alf 31974 2012-09-30 2013-06-06
2: 3 Alf 32009 2013-06-07 2013-08-20
3: 4 Alf 32040 2013-08-20 2014-08-13
5: 2 Alf 31974 2014-08-13 2016-08-23 # collapsed observation (one time), keeping start_date of the first collapsed observation and end_date of the last collapsed observation
6: 5 Alf 358291 2016-08-24 2016-12-01 # collapsed one time
7: 1 Alf 31665 2016-12-01 <NA>
8: 13765 Bet 31345 2012-12-18 <NA> # collapsed observation (3 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
13: 43164 Gam 363109 2016-11-11 2020-01-11 # collapsed observation (2 times),keeping start_date of the first collapsed observation and end_date of the last collapsed observation
出于这个目的,我尝试了以下方法,但它看起来不太直截了当,当我需要多次调整日期时,它也不起作用
df <- setDT(df)[order(employeesID,start_date), same_dyd := ifelse(dyadID==lag(dyadID),1,0),
by=.(employeesID) # this identifies the observations I need to collapse
][is.na(same_dyd),same_dyd:=0
][order(employeesID,start_date),
new_start_date:=if_else(same_dyd==1,lag(start_date),start_date)] # this creates a new variable with the correct date when there is only one new contract.
df我们可以根据'dyadID','dyadID',employeesID','employeersid'的游程长度id进行分组,通过分别获取'start_date'和'end_date'的第一个
和最后一个
元素进行总结
library(data.table)
df[, .(start_date = first(start_date),
end_date = last(end_date)),
.(grp = rleid(dyadID), dyadID, employeesID, employersID)]
如果要保留每组第一行的列值,请使用.I
创建一个行索引,并使用该索引从原始数据中提取不在摘要中的行和列
out <- df[, .(start_date = first(start_date),
end_date = last(end_date), rn = .I[1]),
.(grp = rleid(dyadID), dyadID, employeesID, employersID)]
cbind(out, df[out$rn, setdiff(names(df), names(out)), with = FALSE])
out哇!!!!这是惊人的简洁和高效!!!谢谢@akrun!我做梦也不会想到这样的事情!你介意详细解释一下这背后的逻辑吗?@Alex有一行不匹配。是因为shift(开始日期,类型='lead')==end\u日期您是指原始df上的第7行吗?你说得对,我并没有错把它弄坏。您的输出是正确的。我还想知道是否有一种方法可以使用相同的方法,但同时在df中保留其他变量(对于相同的二元数应该是相同的)?@Alex这是可以做到的,但我有一个疑问。你想留在这里的那一排。使用开始日期
,我们将保留第一个元素和结束日期
,最后一个元素非常有用!!非常感谢你!是的,我编辑了这个