R:group by表格中缺失级别的零填充
我想为具有时序事件的数据表创建一个时间单元向量。该向量中的每个元素表示特定时隙内的度量。数据表R:group by表格中缺失级别的零填充,r,data.table,R,Data.table,我想为具有时序事件的数据表创建一个时间单元向量。该向量中的每个元素表示特定时隙内的度量。数据表dt如下所示: dt=structure(list( hour = c("20", "21", "21", "21", "21", "02", "02", "02", "02", "02"), timeSlt = structure(c(6L, 6L, 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L), .Label = c("[0,4)",
dt
如下所示:
dt=structure(list(
hour = c("20", "21", "21", "21", "21", "02", "02", "02", "02", "02"),
timeSlt = structure(c(6L, 6L, 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L), .Label = c("[0,4)", "[4,8)", "[8,12)", "[12,16)", "[16,20)", "[20,24)"), class = "factor"),
play_length = c(208.67, 188.49, 58.5, 3.469, 17.92, 211.513, 193.045, 225.306, 212.715, 226.873)),
.Names = c("hour", "timeSlt", "length"),
class = c("data.table","data.frame"), row.names = c(NA, -10L))
其中,hour
列和timeSlt
列分别指示每天的小时数和相应的时段<代码>时间lt是一个因素:
dt[, timeSlt]
# [1] [20,24) [20,24) [20,24) [20,24) [20,24) [0,4) [0,4) [0,4) [0,4) [0,4)
# Levels: [0,4) [4,8) [8,12) [12,16) [16,20) [20,24)
我想对每个时隙的长度
求和:
dt[, sum(length), by=timeSlt]
# timeSlt V1
# 1: [20,24) 477.049
# 2: [0,4) 1069.452
但是期望的输出应该是
y = data.table(timeSlt=levels(dt[, timeSlt]), sumLength=c(1069.452, 0, 0, 0, 0, 477.049))
# timeSlt sumLength
# 1: [0,4) 1069.452
# 2: [4,8) 0.000
# 3: [8,12) 0.000
# 4: [12,16) 0.000
# 5: [16,20) 0.000
# 6: [20,24) 477.049
排序后的timeSlt
,如果没有发生事件,则相应的length
之和填充为0
任何帮助都将不胜感激。我们可以根据“timeSlt”的
级别加入新创建的data.table上的,然后按“timeSlt”分组并获得“length”的总和
dt[setDT(list(timeSlt= levels(dt$timeSlt))), on='timeSlt'
][, list(sumLength=sum(length, na.rm=TRUE)), by = timeSlt]
# timeSlt sumLength
#1: [0,4) 1069.452
#2: [4,8) 0.000
#3: [8,12) 0.000
#4: [12,16) 0.000
#5: [16,20) 0.000
#6: [20,24) 477.049
一个base R
选项将是
as.data.frame(xtabs(length~timeSlt, dt))
# timeSlt Freq
#1 [0,4) 1069.452
#2 [4,8) 0.000
#3 [8,12) 0.000
#4 [12,16) 0.000
#5 [16,20) 0.000
#6 [20,24) 477.049
这里有一个dplyr方法
library(tidyr)
library(dplyr)
library(rex)
time_slot_regex = rex("[",
digits %>% capture,
",",
digits %>% capture,
")")
time_slots =
data_frame(start = 0:5 * 4,
end = start + 4,
length = 0)
dt %>%
extract(timeSlt,
c("start", "end"),
time_slot_regex,
convert = TRUE) %>%
bind_rows(time_slots) %>%
group_by(start, end) %>%
summarize(sum_length = sum(length))
也可以使用tapply
res <- tapply(dt$length, dt$timeSlt, sum)
res
# [0,4) [4,8) [8,12) [12,16) [16,20) [20,24)
# 1069.452 NA NA NA NA 477.049
data.frame(timeSlt=names(res), sumLength=res, row.names=1:length(res))
# timeSlt sumLength
# 1 [0,4) 1069.452
# 2 [4,8) NA
# 3 [8,12) NA
# 4 [12,16) NA
# 5 [16,20) NA
# 6 [20,24) 477.049
res伟大的答案!非常干净。学习。