R 根据子组在季度时间序列中填写缺失的日期和时间
我有以下类型的数据,只是比这个大得多R 根据子组在季度时间序列中填写缺失的日期和时间,r,date,data.table,R,Date,Data.table,我有以下类型的数据,只是比这个大得多 DIST TALUK HOBLI CODE DATE REC_TIME RAIN DK P1 A1 1503 01-06-19 00:00:00 22.5 DK P1 A1 1503 01-06-19 00:15:00 23.0 DK P1 A1 1503 01-06-19 00:30:00 23.0 DK P1 A1 1503 01-06-19 00:
DIST TALUK HOBLI CODE DATE REC_TIME RAIN
DK P1 A1 1503 01-06-19 00:00:00 22.5
DK P1 A1 1503 01-06-19 00:15:00 23.0
DK P1 A1 1503 01-06-19 00:30:00 23.0
DK P1 A1 1503 01-06-19 00:45:00 23.0
DK P1 A1 1503 01-06-19 01:00:00 23.0
DK P1 A1 1503 01-06-19 01:15:00 23.0
DK P1 A1 1503 01-06-19 01:30:00 23.0
DK P1 A1 1503 01-06-19 01:45:00 23.0
DK P1 A1 1503 01-06-19 02:00:00 23.0
DK P1 A2 515 01-06-19 22:15:00 23.0
DK P1 A2 515 01-06-19 22:30:00 23.0
DK P1 A2 515 01-06-19 22:45:00 23.0
DK P1 A2 515 01-06-19 23:00:00 23.0
DK P2 A3 633 01-07-19 22:15:00 23.0
DK P2 A3 633 01-07-19 22:30:00 24.0
DK P2 A3 633 01-07-19 22:45:00 24.0
DK P2 A3 633 01-07-19 23:00:00 24.0
DK P2 A3 633 01-07-19 23:15:00 24.0
DK P2 A3 633 01-07-19 23:30:00 29.0
DK P2 A3 633 01-07-19 23:45:00 32.0
DK P2 A3 633 02-07-19 00:00:00 36.0
DK P2 A3 633 02-07-19 00:15:00 36.0
DK P3 B1 845 01-06-19 05:30:00 36.0
DK P3 B1 845 01-06-19 05:45:00 36.0
DK P3 B1 845 01-06-19 06:00:00 36.0
DK P3 B1 845 01-06-19 06:15:00 36.0
DK P3 B1 845 01-06-19 06:30:00 36.0
DK P3 B1 845 01-06-19 06:45:00 36.0
DK P3 B1 845 01-06-19 07:00:00 36.0
DK P3 B1 845 01-06-19 07:15:00 36.0
DK P3 B2 789 01-06-19 07:30:00 36.0
DK P3 B2 789 01-06-19 07:45:00 36.0
DK P3 B2 789 01-06-19 08:00:00 36.0
DK P3 B2 789 01-06-19 08:15:00 36.0
DK P3 B2 789 01-06-19 08:30:00 36.0
DK P3 B2 789 01-06-19 08:45:00 0.0
DK P3 B2 789 01-06-19 09:00:00 0.0
DK P3 B2 789 01-06-19 09:15:00 0.0
DK P3 B2 789 01-06-19 09:30:00 0.0
DK P4 B4 801 22-08-19 00:00:00 0.0
DK P4 B4 801 22-08-19 00:15:00 0.0
DK P4 B4 801 22-08-19 00:30:00 0.5
DK P4 B4 801 22-08-19 00:45:00 0.5
DK P4 B4 801 22-08-19 22:30:00 0.5
DK P4 B4 801 22-08-19 22:45:00 0.5
DK P4 B4 801 30-11-19 21:45:00 0.5
DK P4 B4 801 30-11-19 22:00:00 0.5
DK P4 B4 801 30-11-19 22:15:00 0.5
DK P4 B4 801 30-11-19 22:30:00 2.0
DK P4 B4 801 30-11-19 22:45:00 5.5
DK P4 B4 801 30-11-19 23:00:00 5.5
DK P4 B4 801 30-11-19 23:15:00 5.5
DK P4 B4 801 30-11-19 23:30:00 5.5
DK P4 B4 801 30-11-19 23:45:00 5.5
数据从01-06-19
(01-Jun-19)到30-11-19
(19-11-30)开始,每小时有四次读数,但对于某些台站,此序列中的某些天和时间的观测值缺失。我想填写那些缺失的日期和记录时间,以便每个观测站都有从19年6月1日到19年11月30日的观测结果。此类日期和记录时间的可变降雨量应填充NA
我尝试了stack overflow中人们建议的几个选项,但没有得到想要的结果。
我还尝试了以下方法:
df_1 <- df[, .(RECORDED_DATE = seq(as.Date(min(df$RECORDED_DATE)), as.Date(max(df$RECORDED_DATE)), "day")), by = list(DISTRICT, TALUKNAME, HOBLINAME, TRGCODE, HOUR)]
我还尝试了tidyverse
,完成了,但没有得到预期的结果,因为数据帧中有错误。数据以日期为字符,在使用tidyverse
或将其转换为DOUBLE后,不会进行合并。我尝试将字符转换为数字,但结果是日期列中填充了NA。
任何帮助都将不胜感激。使用
dplyr
和tidyr
,我们可以将日期和时间列与unite
组合,然后从min
和max
DATETIME
创建一个每隔15分钟的序列,并在单独的列中获取日期和时间
library(dplyr)
library(tidyr)
df %>%
unite(DATETIME, DATE, REC_TIME, sep = " ", remove = FALSE) %>%
mutate(DATETIME = as.POSIXct(DATETIME, format = "%d-%m-%y %T", tz = "UTC")) %>%
complete(CODE, DATETIME = seq(min(DATETIME), max(DATETIME), by = "15 min")) %>%
mutate(DATE = as.Date(DATETIME), REC_TIME = format(DATETIME, "%T")) %>%
select(-DATETIME) %>%
group_by(CODE) %>%
fill(DIST, TALUK, HOBLI, .direction = "updown")
数据
df <- structure(list(DIST = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "DK", class = "factor"),
TALUK = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("P1",
"P2", "P3", "P4"), class = "factor"), HOBLI = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("A1", "A2", "A3",
"B1", "B2", "B4"), class = "factor"), CODE = c(1503L, 1503L,
1503L, 1503L, 1503L, 1503L, 1503L, 1503L, 1503L, 515L, 515L,
515L, 515L, 633L, 633L, 633L, 633L, 633L, 633L, 633L, 633L,
633L, 845L, 845L, 845L, 845L, 845L, 845L, 845L, 845L, 789L,
789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 801L, 801L,
801L, 801L, 801L, 801L, 801L, 801L, 801L, 801L, 801L, 801L,
801L, 801L, 801L), DATE = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L), .Label = c("01-06-19", "01-07-19", "02-07-19",
"22-08-19", "30-11-19"), class = "factor"), REC_TIME = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 29L, 30L, 31L, 32L, 29L,
30L, 31L, 32L, 33L, 34L, 35L, 1L, 2L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 1L, 2L, 3L, 4L, 30L, 31L, 27L, 28L, 29L, 30L, 31L, 32L,
33L, 34L, 35L), .Label = c("00:00:00", "00:15:00", "00:30:00",
"00:45:00", "01:00:00", "01:15:00", "01:30:00", "01:45:00",
"02:00:00", "05:30:00", "05:45:00", "06:00:00", "06:15:00",
"06:30:00", "06:45:00", "07:00:00", "07:15:00", "07:30:00",
"07:45:00", "08:00:00", "08:15:00", "08:30:00", "08:45:00",
"09:00:00", "09:15:00", "09:30:00", "21:45:00", "22:00:00",
"22:15:00", "22:30:00", "22:45:00", "23:00:00", "23:15:00",
"23:30:00", "23:45:00"), class = "factor"), RAIN = c(22.5,
23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24,
24, 24, 29, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36,
36, 36, 36, 36, 0, 0, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 2, 5.5, 5.5, 5.5, 5.5, 5.5)), class = "data.frame", row.names = c(NA, -54L))
df如果您的数据集很大,使用数据可能会更快。表
:
ans <- DT[CJ(CODE, dt=seq(min(dt), max(dt), by="15 mins"), unique=TRUE),
on=.(CODE, dt), roll="nearest"]
ans[DateTime!=dt, `:=` (
.(DATE=format(dt, format="%d-%m-%y"),
REC_TIME=format(dt, format="%H:%M:%S"),
RAIN=NA_real_)
)][,
DateTime := NULL]
ans它给出的错误是:seq.int(0,to0-from,by)中的错误:'to'必须是一个有限的数字,而且我们可以为缺少的日期指定相应的名称和电台代码而不是NA。@Ajay哪一列是您的电台代码?如果是DIST
则将complete
行更改为complete(DIST,DATETIME=seq(min(DATETIME),max(DATETIME),by=“15 min”)
。要填充相应的S值,您可以在末尾添加%%>%fill(everything())
,以填充这些值。上述数据集中的“站点代码”列由“代码”列表示。我仍然在seq.int(0,to0-from,by)中得到错误作为错误:'to'必须是一个有限数。我是否需要写seq(as.Date(min(DATETIME))),但DATETIME不是日期对象,所以我不能这样写。@我已经更新了我正在使用的数据。你能用这些数据检查一下你是否得到了答案吗?根据你的数据集,它将NA分配给RAIN列
ans <- DT[CJ(CODE, dt=seq(min(dt), max(dt), by="15 mins"), unique=TRUE),
on=.(CODE, dt), roll="nearest"]
ans[DateTime!=dt, `:=` (
.(DATE=format(dt, format="%d-%m-%y"),
REC_TIME=format(dt, format="%H:%M:%S"),
RAIN=NA_real_)
)][,
DateTime := NULL]
library(data.table)
DT <- fread("DIST TALUK HOBLI CODE DATE REC_TIME RAIN
DK P1 A1 1503 01-06-19 00:00:00 22.5
DK P1 A1 1503 01-06-19 00:15:00 23.0
DK P1 A1 1503 01-06-19 00:30:00 23.0
DK P1 A1 1503 01-06-19 00:45:00 23.0
DK P1 A1 1503 01-06-19 01:00:00 23.0
DK P1 A1 1503 01-06-19 01:15:00 23.0
DK P1 A1 1503 01-06-19 01:30:00 23.0
DK P1 A1 1503 01-06-19 01:45:00 23.0
DK P1 A1 1503 01-06-19 02:00:00 23.0
DK P1 A2 515 01-06-19 22:15:00 23.0
DK P1 A2 515 01-06-19 22:30:00 23.0
DK P1 A2 515 01-06-19 22:45:00 23.0
DK P1 A2 515 01-06-19 23:00:00 23.0
DK P2 A3 633 01-07-19 22:15:00 23.0
DK P2 A3 633 01-07-19 22:30:00 24.0
DK P2 A3 633 01-07-19 22:45:00 24.0
DK P2 A3 633 01-07-19 23:00:00 24.0
DK P2 A3 633 01-07-19 23:15:00 24.0
DK P2 A3 633 01-07-19 23:30:00 29.0
DK P2 A3 633 01-07-19 23:45:00 32.0
DK P2 A3 633 02-07-19 00:00:00 36.0
DK P2 A3 633 02-07-19 00:15:00 36.0
DK P3 B1 845 01-06-19 05:30:00 36.0
DK P3 B1 845 01-06-19 05:45:00 36.0
DK P3 B1 845 01-06-19 06:00:00 36.0
DK P3 B1 845 01-06-19 06:15:00 36.0
DK P3 B1 845 01-06-19 06:30:00 36.0
DK P3 B1 845 01-06-19 06:45:00 36.0
DK P3 B1 845 01-06-19 07:00:00 36.0
DK P3 B1 845 01-06-19 07:15:00 36.0
DK P3 B2 789 01-06-19 07:30:00 36.0
DK P3 B2 789 01-06-19 07:45:00 36.0
DK P3 B2 789 01-06-19 08:00:00 36.0
DK P3 B2 789 01-06-19 08:15:00 36.0
DK P3 B2 789 01-06-19 08:30:00 36.0
DK P3 B2 789 01-06-19 08:45:00 0.0
DK P3 B2 789 01-06-19 09:00:00 0.0
DK P3 B2 789 01-06-19 09:15:00 0.0
DK P3 B2 789 01-06-19 09:30:00 0.0
DK P4 B4 801 22-08-19 00:00:00 0.0
DK P4 B4 801 22-08-19 00:15:00 0.0
DK P4 B4 801 22-08-19 00:30:00 0.5
DK P4 B4 801 22-08-19 00:45:00 0.5
DK P4 B4 801 22-08-19 22:30:00 0.5
DK P4 B4 801 22-08-19 22:45:00 0.5
DK P4 B4 801 30-11-19 21:45:00 0.5
DK P4 B4 801 30-11-19 22:00:00 0.5
DK P4 B4 801 30-11-19 22:15:00 0.5
DK P4 B4 801 30-11-19 22:30:00 2.0
DK P4 B4 801 30-11-19 22:45:00 5.5
DK P4 B4 801 30-11-19 23:00:00 5.5
DK P4 B4 801 30-11-19 23:15:00 5.5
DK P4 B4 801 30-11-19 23:30:00 5.5
DK P4 B4 801 30-11-19 23:45:00 5.5")
DT[, dt := as.POSIXct(paste0(DATE, REC_TIME), format="%d-%m-%y %H:%M:%S")][,
DateTime := dt]