R 当数据与组标题散布在同一列中时,如何对数据进行分组?
基本上,我的数据按天分组,其间的行数不一致:R 当数据与组标题散布在同一列中时,如何对数据进行分组?,r,data.table,R,Data.table,基本上,我的数据按天分组,其间的行数不一致: 16-Oct-16 Name1 Name2 Name3 17-Oct-16 Name1 Name2 Name3 Name4 Name5 19-Oct-16 等等 我需要能够获取组数据并将其应用于子记录。 预期结果应如下所示: Name1 16-Oct-16 Name2 16-Oct-16 Name3 16-Oct-16 Name1 17-Oct-16 Name2 17-Oct-16 Name3 17-Oct-16 Name4
16-Oct-16
Name1
Name2
Name3
17-Oct-16
Name1
Name2
Name3
Name4
Name5
19-Oct-16
等等
我需要能够获取组数据并将其应用于子记录。
预期结果应如下所示:
Name1 16-Oct-16
Name2 16-Oct-16
Name3 16-Oct-16
Name1 17-Oct-16
Name2 17-Oct-16
Name3 17-Oct-16
Name4 17-Oct-16
Name5 17-Oct-16
我使用的是data.table
,但目前除了循环之外,我想不出任何其他方法
以下脚本生成我正在查看的数据集类型:
data.table(c('October 16, 2016', paste0('Name',1:4),
'October 17, 2016', paste0('Name',1:12),
'October 20, 2016', paste0('Name',1:2),
'October 25, 2016', paste0('Name',1:6)))
我只想将适当的日期字段复制到每一个名称行,最后得到一个整洁的数据集,其中每一行都有名称和日期。有一个
数据。我在类似情况下使用了表解决方案。(我已经用data.table
1.9.7版进行了测试,但它也应该适用于CRAN 1.9.6版)
读取数据
(我想有一个更快的版本使用滚动联接而不是na.locf
)
删除组标题行
要删除组标题,我们需要保留一个临时列
dt[, tmp := lubridate::dmy(V1)][, day := zoo::na.locf(tmp)]
print(dt)
V1 tmp day
1: 16-Oct-16 2016-10-16 2016-10-16
2: Name1 <NA> 2016-10-16
3: Name2 <NA> 2016-10-16
4: Name3 <NA> 2016-10-16
5: 17-Oct-16 2016-10-17 2016-10-17
6: Name1 <NA> 2016-10-17
7: Name2 <NA> 2016-10-17
8: Name3 <NA> 2016-10-17
9: Name4 <NA> 2016-10-17
10: Name5 <NA> 2016-10-17
11: 19-Oct-16 2016-10-19 2016-10-19
dt <- dt[is.na(tmp)]
print(dt)
V1 tmp day
1: Name1 <NA> 2016-10-16
2: Name2 <NA> 2016-10-16
3: Name3 <NA> 2016-10-16
4: Name1 <NA> 2016-10-17
5: Name2 <NA> 2016-10-17
6: Name3 <NA> 2016-10-17
7: Name4 <NA> 2016-10-17
8: Name5 <NA> 2016-10-17
dt[, tmp := NULL]
print(dt)
V1 day
1: Name1 2016-10-16
2: Name2 2016-10-16
3: Name3 2016-10-16
4: Name1 2016-10-17
5: Name2 2016-10-17
6: Name3 2016-10-17
7: Name4 2016-10-17
8: Name5 2016-10-17
dt[,tmp:=lubridate::dmy(V1)][,day:=zoo::na.locf(tmp)]
打印(dt)
V1 tmp日
1:16-10-16 2016-10-16 2016-10-16
2:名称1 2016-10-16
3:姓名2 2016-10-16
4:姓名3 2016-10-16
5:2016年10月17日2016年10月17日2016年10月17日
6:名称1 2016-10-17
7:Name2 2016-10-17
8:姓名3 2016-10-17
9:姓名4 2016-10-17
10:Name5 2016-10-17
11:19-10-16 2016-10-19 2016-10-19
dt另一种选择是使用正则表达式模式。对于第一个示例数据集:
library(data.table)
library(zoo)
dt1[grep('([0-9]{1,2})-([A-Za-z]+)-(\\d{2})', V1), V2 := V1
][, V2 := na.locf(V2)][V1!=V2]
其中:
V1 V2
1: Name1 16-Oct-16
2: Name2 16-Oct-16
3: Name3 16-Oct-16
4: Name1 17-Oct-16
5: Name2 17-Oct-16
6: Name3 17-Oct-16
7: Name4 17-Oct-16
8: Name5 17-Oct-16
V1 V2
1: Name1 October 16, 2016
2: Name2 October 16, 2016
3: Name3 October 16, 2016
4: Name4 October 16, 2016
5: Name1 October 17, 2016
6: Name2 October 17, 2016
7: Name3 October 17, 2016
8: Name4 October 17, 2016
9: Name5 October 17, 2016
10: Name6 October 17, 2016
11: Name7 October 17, 2016
12: Name8 October 17, 2016
13: Name9 October 17, 2016
14: Name10 October 17, 2016
15: Name11 October 17, 2016
16: Name12 October 17, 2016
17: Name1 October 20, 2016
18: Name2 October 20, 2016
19: Name1 October 25, 2016
20: Name2 October 25, 2016
21: Name3 October 25, 2016
22: Name4 October 25, 2016
23: Name5 October 25, 2016
24: Name6 October 25, 2016
对于第二个数据集,可以使用:
dt2[grep('([A-Za-z]+ )([0-9]{1,2}[,] )(\\d{4})', V1), V2 := V1
][, V2 := na.locf(V2)][V1!=V2]
其中:
V1 V2
1: Name1 16-Oct-16
2: Name2 16-Oct-16
3: Name3 16-Oct-16
4: Name1 17-Oct-16
5: Name2 17-Oct-16
6: Name3 17-Oct-16
7: Name4 17-Oct-16
8: Name5 17-Oct-16
V1 V2
1: Name1 October 16, 2016
2: Name2 October 16, 2016
3: Name3 October 16, 2016
4: Name4 October 16, 2016
5: Name1 October 17, 2016
6: Name2 October 17, 2016
7: Name3 October 17, 2016
8: Name4 October 17, 2016
9: Name5 October 17, 2016
10: Name6 October 17, 2016
11: Name7 October 17, 2016
12: Name8 October 17, 2016
13: Name9 October 17, 2016
14: Name10 October 17, 2016
15: Name11 October 17, 2016
16: Name12 October 17, 2016
17: Name1 October 20, 2016
18: Name2 October 20, 2016
19: Name1 October 25, 2016
20: Name2 October 25, 2016
21: Name3 October 25, 2016
22: Name4 October 25, 2016
23: Name5 October 25, 2016
24: Name6 October 25, 2016
使用数据:
dt1 <- fread("16-Oct-16
Name1
Name2
Name3
17-Oct-16
Name1
Name2
Name3
Name4
Name5
19-Oct-16", header = FALSE)
dt1在现实世界中,首先如何获取这些数据?这听起来像是一个完美的工具,可以在将数据集加载到RCU之前对其进行整理。请您在示例数据中更加精确,您的预期结果是什么?有关如何改进您的问题的指南,请参阅,举个例子会有所帮助。您显示的示例数据和生成它们的脚本应该是一致的。现在,他们使用两种不同的日期格式。
dt2 <- data.table(c('October 16, 2016', paste0('Name',1:4),
'October 17, 2016', paste0('Name',1:12),
'October 20, 2016', paste0('Name',1:2),
'October 25, 2016', paste0('Name',1:6)))