R 当数据与组标题散布在同一列中时,如何对数据进行分组?

R 当数据与组标题散布在同一列中时,如何对数据进行分组?,r,data.table,R,Data.table,基本上,我的数据按天分组,其间的行数不一致: 16-Oct-16 Name1 Name2 Name3 17-Oct-16 Name1 Name2 Name3 Name4 Name5 19-Oct-16 等等 我需要能够获取组数据并将其应用于子记录。 预期结果应如下所示: Name1 16-Oct-16 Name2 16-Oct-16 Name3 16-Oct-16 Name1 17-Oct-16 Name2 17-Oct-16 Name3 17-Oct-16 Name4

基本上,我的数据按天分组,其间的行数不一致:

16-Oct-16
Name1
Name2
Name3
17-Oct-16
Name1
Name2
Name3
Name4
Name5
19-Oct-16
等等

我需要能够获取组数据并将其应用于子记录。 预期结果应如下所示:

Name1   16-Oct-16
Name2   16-Oct-16
Name3   16-Oct-16
Name1   17-Oct-16
Name2   17-Oct-16
Name3   17-Oct-16
Name4   17-Oct-16
Name5   17-Oct-16
我使用的是
data.table
,但目前除了循环之外,我想不出任何其他方法

以下脚本生成我正在查看的数据集类型:

data.table(c('October 16, 2016', paste0('Name',1:4),
             'October 17, 2016', paste0('Name',1:12),
             'October 20, 2016', paste0('Name',1:2),
             'October 25, 2016', paste0('Name',1:6)))

我只想将适当的日期字段复制到每一个名称行,最后得到一个整洁的数据集,其中每一行都有名称和日期。

有一个
数据。我在类似情况下使用了表
解决方案。(我已经用
data.table
1.9.7版进行了测试,但它也应该适用于CRAN 1.9.6版)

读取数据 (我想有一个更快的版本使用滚动联接而不是
na.locf

删除组标题行 要删除组标题,我们需要保留一个临时列

dt[, tmp := lubridate::dmy(V1)][, day := zoo::na.locf(tmp)]
print(dt)
           V1        tmp        day
 1: 16-Oct-16 2016-10-16 2016-10-16
 2:     Name1       <NA> 2016-10-16
 3:     Name2       <NA> 2016-10-16
 4:     Name3       <NA> 2016-10-16
 5: 17-Oct-16 2016-10-17 2016-10-17
 6:     Name1       <NA> 2016-10-17
 7:     Name2       <NA> 2016-10-17
 8:     Name3       <NA> 2016-10-17
 9:     Name4       <NA> 2016-10-17
10:     Name5       <NA> 2016-10-17
11: 19-Oct-16 2016-10-19 2016-10-19

dt <- dt[is.na(tmp)]
print(dt)
      V1  tmp        day
1: Name1 <NA> 2016-10-16
2: Name2 <NA> 2016-10-16
3: Name3 <NA> 2016-10-16
4: Name1 <NA> 2016-10-17
5: Name2 <NA> 2016-10-17
6: Name3 <NA> 2016-10-17
7: Name4 <NA> 2016-10-17
8: Name5 <NA> 2016-10-17

dt[, tmp := NULL]
print(dt)
      V1        day
1: Name1 2016-10-16
2: Name2 2016-10-16
3: Name3 2016-10-16
4: Name1 2016-10-17
5: Name2 2016-10-17
6: Name3 2016-10-17
7: Name4 2016-10-17
8: Name5 2016-10-17
dt[,tmp:=lubridate::dmy(V1)][,day:=zoo::na.locf(tmp)]
打印(dt)
V1 tmp日
1:16-10-16 2016-10-16 2016-10-16
2:名称1 2016-10-16
3:姓名2 2016-10-16
4:姓名3 2016-10-16
5:2016年10月17日2016年10月17日2016年10月17日
6:名称1 2016-10-17
7:Name2 2016-10-17
8:姓名3 2016-10-17
9:姓名4 2016-10-17
10:Name5 2016-10-17
11:19-10-16 2016-10-19 2016-10-19

dt另一种选择是使用正则表达式模式。对于第一个示例数据集:

library(data.table)
library(zoo)
dt1[grep('([0-9]{1,2})-([A-Za-z]+)-(\\d{2})', V1), V2 := V1
    ][, V2 := na.locf(V2)][V1!=V2]
其中:

      V1        V2
1: Name1 16-Oct-16
2: Name2 16-Oct-16
3: Name3 16-Oct-16
4: Name1 17-Oct-16
5: Name2 17-Oct-16
6: Name3 17-Oct-16
7: Name4 17-Oct-16
8: Name5 17-Oct-16
        V1               V2
 1:  Name1 October 16, 2016
 2:  Name2 October 16, 2016
 3:  Name3 October 16, 2016
 4:  Name4 October 16, 2016
 5:  Name1 October 17, 2016
 6:  Name2 October 17, 2016
 7:  Name3 October 17, 2016
 8:  Name4 October 17, 2016
 9:  Name5 October 17, 2016
10:  Name6 October 17, 2016
11:  Name7 October 17, 2016
12:  Name8 October 17, 2016
13:  Name9 October 17, 2016
14: Name10 October 17, 2016
15: Name11 October 17, 2016
16: Name12 October 17, 2016
17:  Name1 October 20, 2016
18:  Name2 October 20, 2016
19:  Name1 October 25, 2016
20:  Name2 October 25, 2016
21:  Name3 October 25, 2016
22:  Name4 October 25, 2016
23:  Name5 October 25, 2016
24:  Name6 October 25, 2016
对于第二个数据集,可以使用:

dt2[grep('([A-Za-z]+ )([0-9]{1,2}[,] )(\\d{4})', V1), V2 := V1
    ][, V2 := na.locf(V2)][V1!=V2]
其中:

      V1        V2
1: Name1 16-Oct-16
2: Name2 16-Oct-16
3: Name3 16-Oct-16
4: Name1 17-Oct-16
5: Name2 17-Oct-16
6: Name3 17-Oct-16
7: Name4 17-Oct-16
8: Name5 17-Oct-16
        V1               V2
 1:  Name1 October 16, 2016
 2:  Name2 October 16, 2016
 3:  Name3 October 16, 2016
 4:  Name4 October 16, 2016
 5:  Name1 October 17, 2016
 6:  Name2 October 17, 2016
 7:  Name3 October 17, 2016
 8:  Name4 October 17, 2016
 9:  Name5 October 17, 2016
10:  Name6 October 17, 2016
11:  Name7 October 17, 2016
12:  Name8 October 17, 2016
13:  Name9 October 17, 2016
14: Name10 October 17, 2016
15: Name11 October 17, 2016
16: Name12 October 17, 2016
17:  Name1 October 20, 2016
18:  Name2 October 20, 2016
19:  Name1 October 25, 2016
20:  Name2 October 25, 2016
21:  Name3 October 25, 2016
22:  Name4 October 25, 2016
23:  Name5 October 25, 2016
24:  Name6 October 25, 2016

使用数据:

dt1 <- fread("16-Oct-16
             Name1
             Name2
             Name3
             17-Oct-16
             Name1
             Name2
             Name3
             Name4
             Name5
             19-Oct-16", header = FALSE)

dt1在现实世界中,首先如何获取这些数据?这听起来像是一个完美的工具,可以在将数据集加载到RCU之前对其进行整理。请您在示例数据中更加精确,您的预期结果是什么?有关如何改进您的问题的指南,请参阅,举个例子会有所帮助。您显示的示例数据和生成它们的脚本应该是一致的。现在,他们使用两种不同的日期格式。
dt2 <- data.table(c('October 16, 2016', paste0('Name',1:4),
                    'October 17, 2016', paste0('Name',1:12),
                    'October 20, 2016', paste0('Name',1:2),
                    'October 25, 2016', paste0('Name',1:6)))