R 插补中间缺失值

R 插补中间缺失值,r,data.table,zoo,R,Data.table,Zoo,我有一组级别的数据。外观如下所示 I实际数据为“值”&所需数据为“预期值” 我尝试了以下代码: setDT(file_to_share)[,Expected_Value := na.locf(na.locf(Value, na.rm=FALSE), fromLast=TRUE),by = c("Group_A", "Group_B")] 但在该代码中,插补是对整个缺失值进行的。如果缺失值介于两个值之间,我想计算缺失值。缺少的值将是以前可用值的复制 如果有人能指导我怎么做,那将是一个很大的帮

我有一组级别的数据。外观如下所示

I实际数据为“值”&所需数据为“预期值”

我尝试了以下代码:

setDT(file_to_share)[,Expected_Value := na.locf(na.locf(Value, na.rm=FALSE), fromLast=TRUE),by = c("Group_A",   "Group_B")]
但在该代码中,插补是对整个缺失值进行的。如果缺失值介于两个值之间,我想计算缺失值。缺少的值将是以前可用值的复制

如果有人能指导我怎么做,那将是一个很大的帮助


注意:我试图使用
数据进行计算。表
动物园
。但任何其他方法都可以

即使您正在寻找一个
data.table
解决方案,下面是一个使用
tidyverse
方法的解决方案。(如果时间允许,我可能会尝试翻译成
data.table

我们的想法是创建一个分组变量来捕获您的周数,并在分组a、分组B和分组周(此处称为
grp
)下填充
值。我们还创建了
Value
fill
fromlast(
tidyr
术语为
。direction='up'
)的副本。然后,我们用
NA
值的累积和创建另一个分组变量,并将
Value
列中的值替换为
NA
,前提是新的组大小(
groupa
groupb
grp
grp1
)为1,其
Value 1
NA
。这就给出了预期的结果

library(tidyverse)

df2 <- df1 %>% 
  mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y')) %>% 
  mutate(value1 = Value) %>%
  group_by(Group_A, GROUP_B, grp = cumsum(format(Date, '%d')=='01'))%>% 
  fill(Value) %>% 
  fill(value1, .direction = 'up') %>% 
  mutate(grp1 = cumsum(is.na(Value))) %>% 
  group_by(Group_A, GROUP_B, grp, grp1) %>% 
  mutate(new = n(), Value = replace(Value, new == 1 | is.na(value1), NA)) %>%
  ungroup() %>%
  select(-c(value1, grp, grp1, new))

OP已要求仅填写
NA
值,这些值位于各组内其他值之间。这意味着在应用
zoo::NA.locf()
时,跳过每组开始或结束处的任何
NA
值序列

使用
data.table
,可以通过标识要跳过的行的索引和一种反连接来完成此操作:

解释
  • 对于每组,
    NA
    /非
    NA
    值的条纹进行编号
  • 将拾取每组中的第一个和最后一个条纹,并从特殊符号
    .I
    检索索引。(由于
    将就地更新,因此第一条或最后一条条纹是否包含
    NA
    并不重要;它们无论如何都不会更新。)
  • 找到的索引

    DT[,,{na_grp尝试tidyr中的完整功能?@reuben:谢谢你的评论。不,我没有尝试tidyr。行
    21:23
    不应该也被填充吗?@Sotos:不。在第2组X第1组中,什么都不会被填充。那是因为“值”之间没有遗漏。我专门创建了这个示例来更好地突出我的问题。谢谢!美化书面回答。它满足了我的所有要求。谢谢!我通常不会改变我的标记答案。但不幸的是,我们的答案与我接近的方式非常相似。所以我不得不改变主意。但非常感谢你的答案。同样棒极了!!我还高估了你的一些答案。它们同样非常好。N问题。谢谢:)
    # A tibble: 42 × 5
       Group_A   GROUP_B       Date Value Expected_Value
         <chr>     <chr>     <dttm> <int>          <int>
    1  GROUP_1 Group_1_1 2017-01-01    NA             NA
    2  GROUP_1 Group_1_1 2017-01-02    NA             NA
    3  GROUP_1 Group_1_1 2017-01-03    34             34
    4  GROUP_1 Group_1_1 2017-01-04    20             20
    5  GROUP_1 Group_1_1 2017-01-05    20             20
    6  GROUP_1 Group_1_1 2017-01-06    20             20
    7  GROUP_1 Group_1_1 2017-01-07    38             38
    8  GROUP_1 Group_1_2 2017-01-01    35             35
    9  GROUP_1 Group_1_2 2017-01-02    28             28
    10 GROUP_1 Group_1_2 2017-01-03    28             28
    # ... with 32 more rows
    
    #Where,
    
    identical(df2$Value, df2$Expected_Value)
    #[1] TRUE
    
    library(data.table)
    setDT(DT)[!DT[, {
      na_grp <- rleid(is.na(Value))
      .I[na_grp %in% c(1L, max(na_grp))]
    }, by = .(Group_A, GROUP_B)]$V1, Value := zoo::na.locf(Value)][]
    
        Group_A    GROUP_B     Date Value Expected_Value
     1: GROUP_1  Group_1_1 1/1/2017    NA             NA
     2: GROUP_1  Group_1_1 1/2/2017    NA             NA
     3: GROUP_1  Group_1_1 1/3/2017    34             34
     4: GROUP_1  Group_1_1 1/4/2017    20             20
     5: GROUP_1  Group_1_1 1/5/2017    20             20
     6: GROUP_1  Group_1_1 1/6/2017    20             20
     7: GROUP_1  Group_1_1 1/7/2017    38             38
     8: GROUP_1  Group_1_2 1/1/2017    35             35
     9: GROUP_1  Group_1_2 1/2/2017    28             28
    10: GROUP_1  Group_1_2 1/3/2017    20             28
    11: GROUP_1  Group_1_2 1/4/2017    32             32
    12: GROUP_1  Group_1_2 1/5/2017    39             39
    13: GROUP_1  Group_1_2 1/6/2017    28             28
    14: GROUP_1  Group_1_2 1/7/2017    NA             NA
    15: GROUP_2 Group_1_11 1/1/2017    NA             NA
    16: GROUP_2 Group_1_11 1/2/2017    NA             NA
    17: GROUP_2 Group_1_11 1/3/2017    40             40
    18: GROUP_2 Group_1_11 1/4/2017    32             32
    19: GROUP_2 Group_1_11 1/5/2017    20             20
    20: GROUP_2 Group_1_11 1/6/2017    NA             NA
    21: GROUP_2 Group_1_11 1/7/2017    NA             NA
    22: GROUP_2 Group_1_21 1/1/2017    NA             NA
    23: GROUP_2 Group_1_21 1/2/2017    32             32
    24: GROUP_2 Group_1_21 1/3/2017    36             36
    25: GROUP_2 Group_1_21 1/4/2017    36             36
    26: GROUP_2 Group_1_21 1/5/2017    28             28
    27: GROUP_2 Group_1_21 1/6/2017    33             33
    28: GROUP_2 Group_1_21 1/7/2017    40             40
    29: GROUP_3 Group_1_13 1/1/2017    NA             NA
    30: GROUP_3 Group_1_13 1/2/2017    NA             NA
    31: GROUP_3 Group_1_13 1/3/2017    NA             NA
    32: GROUP_3 Group_1_13 1/4/2017    29             29
    33: GROUP_3 Group_1_13 1/5/2017    31             31
    34: GROUP_3 Group_1_13 1/6/2017    31             31
    35: GROUP_3 Group_1_13 1/7/2017    34             34
    36: GROUP_3 Group_1_23 1/1/2017    26             26
    37: GROUP_3 Group_1_23 1/2/2017    33             33
    38: GROUP_3 Group_1_23 1/3/2017    27             27
    39: GROUP_3 Group_1_23 1/4/2017    23             23
    40: GROUP_3 Group_1_23 1/5/2017    25             25
    41: GROUP_3 Group_1_23 1/6/2017    41             41
    42: GROUP_3 Group_1_23 1/7/2017    25             25
        Group_A    GROUP_B     Date Value Expected_Value
    
    DT <- structure(list(Group_A = c("GROUP_1", "GROUP_1", "GROUP_1", "GROUP_1", 
    "GROUP_1", "GROUP_1", "GROUP_1", "GROUP_1", "GROUP_1", "GROUP_1", 
    "GROUP_1", "GROUP_1", "GROUP_1", "GROUP_1", "GROUP_2", "GROUP_2", 
    "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", 
    "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", "GROUP_2", 
    "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", 
    "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", "GROUP_3", 
    "GROUP_3", "GROUP_3"), GROUP_B = c("Group_1_1", "Group_1_1", 
    "Group_1_1", "Group_1_1", "Group_1_1", "Group_1_1", "Group_1_1", 
    "Group_1_2", "Group_1_2", "Group_1_2", "Group_1_2", "Group_1_2", 
    "Group_1_2", "Group_1_2", "Group_1_11", "Group_1_11", "Group_1_11", 
    "Group_1_11", "Group_1_11", "Group_1_11", "Group_1_11", "Group_1_21", 
    "Group_1_21", "Group_1_21", "Group_1_21", "Group_1_21", "Group_1_21", 
    "Group_1_21", "Group_1_13", "Group_1_13", "Group_1_13", "Group_1_13", 
    "Group_1_13", "Group_1_13", "Group_1_13", "Group_1_23", "Group_1_23", 
    "Group_1_23", "Group_1_23", "Group_1_23", "Group_1_23", "Group_1_23"
    ), Date = c("1/1/2017", "1/2/2017", "1/3/2017", "1/4/2017", "1/5/2017", 
    "1/6/2017", "1/7/2017", "1/1/2017", "1/2/2017", "1/3/2017", "1/4/2017", 
    "1/5/2017", "1/6/2017", "1/7/2017", "1/1/2017", "1/2/2017", "1/3/2017", 
    "1/4/2017", "1/5/2017", "1/6/2017", "1/7/2017", "1/1/2017", "1/2/2017", 
    "1/3/2017", "1/4/2017", "1/5/2017", "1/6/2017", "1/7/2017", "1/1/2017", 
    "1/2/2017", "1/3/2017", "1/4/2017", "1/5/2017", "1/6/2017", "1/7/2017", 
    "1/1/2017", "1/2/2017", "1/3/2017", "1/4/2017", "1/5/2017", "1/6/2017", 
    "1/7/2017"), Value = c(NA, NA, 34L, 20L, NA, NA, 38L, 35L, 28L, 
    NA, 32L, 39L, 28L, NA, NA, NA, 40L, 32L, 20L, NA, NA, NA, 32L, 
    36L, NA, 28L, 33L, 40L, NA, NA, NA, 29L, 31L, NA, 34L, 26L, 33L, 
    27L, 23L, 25L, 41L, 25L), Expected_Value = c(NA, NA, 34L, 20L, 
    20L, 20L, 38L, 35L, 28L, 28L, 32L, 39L, 28L, NA, NA, NA, 40L, 
    32L, 20L, NA, NA, NA, 32L, 36L, 36L, 28L, 33L, 40L, NA, NA, NA, 
    29L, 31L, 31L, 34L, 26L, 33L, 27L, 23L, 25L, 41L, 25L)), .Names = c("Group_A", 
    "GROUP_B", "Date", "Value", "Expected_Value"), row.names = c(NA, 
    -42L), class = "data.frame")