在r中处理时间序列中的缺失值
我处理的是时间序列数据,我需要有连续的时间戳,但在如下所示的捕获过程中,很少有数据时间戳点丢失在r中处理时间序列中的缺失值,r,time-series,R,Time Series,我处理的是时间序列数据,我需要有连续的时间戳,但在如下所示的捕获过程中,很少有数据时间戳点丢失 ID Time_Stamp A B C 1 02/02/2018 07:45:00 123 567 434 2 02/02/2018 07:45:01 ..... ... 5 02/
ID Time_Stamp A B C
1 02/02/2018 07:45:00 123 567 434
2 02/02/2018 07:45:01
..... ...
5 02/02/2018 07:46:00 mean(A1:A5)
5.1 02/02/2018 07:46:01 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.2 02/02/2018 07:46:02 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.3 02/02/2018 07:46:03 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.4 02/02/2018 07:46:04 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.5 02/02/2018 07:46:05 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.6 02/02/2018 07:46:06 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.7 02/02/2018 07:46:07 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.8 02/02/2018 07:46:08 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.9 02/02/2018 07:46:09 mean(A1:A5) mean(B1:B5) mean(C1:C5)
6 02/02/2018 07:46:10 112 2323 2323
6.1 02/02/2018 07:46:11 mean(A1:A15) mean(B1:B15) mean(C1:C15)
DF
ID Time_Stamp A B C
1 02/02/2018 07:45:00 123 567 434
2 02/02/2018 07:45:01
..... ...
5 02/02/2018 07:46:00
6 02/02/2018 07:46:10 112 2323 2323
如上面的示例df所示,时间戳在第5行
之前是连续的,但在5行
和6行
之间错过了10秒
的捕获数据。我的数据框大约有60000行,手动识别缺少的值是一件乏味的事情。因此,我正在寻找使用R自动化处理缺失值的过程
我的结果数据框如下所示
ID Time_Stamp A B C
1 02/02/2018 07:45:00 123 567 434
2 02/02/2018 07:45:01
..... ...
5 02/02/2018 07:46:00 mean(A1:A5)
5.1 02/02/2018 07:46:01 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.2 02/02/2018 07:46:02 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.3 02/02/2018 07:46:03 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.4 02/02/2018 07:46:04 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.5 02/02/2018 07:46:05 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.6 02/02/2018 07:46:06 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.7 02/02/2018 07:46:07 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.8 02/02/2018 07:46:08 mean(A1:A5) mean(B1:B5) mean(C1:C5)
5.9 02/02/2018 07:46:09 mean(A1:A5) mean(B1:B5) mean(C1:C5)
6 02/02/2018 07:46:10 112 2323 2323
6.1 02/02/2018 07:46:11 mean(A1:A15) mean(B1:B15) mean(C1:C15)
甚至可以是该时间间隔内前几行的平均值
6.1 02/02/2018 07:46:11 mean(A14:A17) mean(B14:B17) mean(C14:C17)
即缺失,但缺失时间值除外
我已经完成了以下代码,以获得整个列的平均值
library(dplyr)
library(tidyr)
df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())
它给出列中所有行的所有平均值的输出
就像下面的一样,它工作得很完美,但我需要修改。我怎么能做到呢。
请帮助这里有一个结合了
tidyverse
和base R的方法来实现这个结果。我们首先创建一个新列,每个列的累积平均值。然后,我们完成
缺失的观察值,并用其他列中的相应平均值替换NA
s
library(tidyverse)
cols <- c("A", "B", "C")
df1 <- df %>%
mutate_at(cols, list(mean = ~cummean(.))) %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
fill(ends_with("mean")) %>%
mutate(ID = row_number())
mean_cols <- grep("_mean$", names(df1))
df1[cols] <- Map(function(x, y) ifelse(is.na(x), y, x), df1[cols], df1[mean_cols])
df1[names(df)]
# ID Time_Stamp A B C
# <int> <dttm> <dbl> <dbl> <dbl>
# 1 1 2018-02-02 07:45:00 123 567 434
# 2 2 2018-02-02 07:45:01 234 100 110
# 3 3 2018-02-02 07:45:02 234 100 110
# 4 4 2018-02-02 07:45:03 197 256. 218
# 5 5 2018-02-02 07:45:04 197 256. 218
# 6 6 2018-02-02 07:45:05 197 256. 218
# 7 7 2018-02-02 07:45:06 197 256. 218
# 8 8 2018-02-02 07:45:07 197 256. 218
# 9 9 2018-02-02 07:45:08 197 256. 218
#10 10 2018-02-02 07:45:09 197 256. 218
#11 11 2018-02-02 07:45:10 112 2323 2323
#12 12 2018-02-02 07:45:11 176. 772. 744.
#13 13 2018-02-02 07:45:12 176. 772. 744.
#14 14 2018-02-02 07:45:13 176. 772. 744.
#15 15 2018-02-02 07:45:14 176. 772. 744.
#16 16 2018-02-02 07:45:15 100 23 12
数据
df <- structure(list(ID = c(1, 2, 3, 4, 5), Time_Stamp = structure(c(1517557500,
1517557501, 1517557502, 1517557510, 1517557515), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), A = c(123, 234, 234, 112, 100), B = c(567,
100, 100, 2323, 23), C = c(434, 110, 110, 2323, 12)), row.names = c(NA,
-5L), class = "data.frame")
有一个非常直观的软件包正是为了这个目的而设计的,叫做“padr”。 我想你会发现它满足了你的需要:
我建议查看
tidyr::fill
或可能使用tsibble
软件包中的时间填充功能。请为我们提供可复制的数据集,以测试我们提出的解决方案on@RonakShah这对我有用。很抱歉迟了答复。
df
# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323
#5 5 2018-02-02 07:45:15 100 23 12