使用dplyr分组后查找缺少的月份_R_Dplyr_Missing Data

使用dplyr分组后查找缺少的月份

使用dplyr分组后查找缺少的月份,r,dplyr,missing-data,R,Dplyr,Missing Data,我有一个数据框，其中有两列，我用dplyr对它们进行分组，一列是月份（作为数字，例如1到12），还有几列是统计数据（值不重要）。例如： ID_1 ID_2 month st1 st2 1 1 1 0.5 0.2 1 1 2 0.7 0.9 1 1 3 1.1 1.7 1 1 4 2.6 0.8 1 1 5 1

我有一个数据框，其中有两列，我用dplyr对它们进行分组，一列是月份（作为数字，例如1到12），还有几列是统计数据（值不重要）。例如：

ID_1   ID_2   month  st1    st2
1      1      1      0.5    0.2
1      1      2      0.7    0.9
1      1      3      1.1    1.7
1      1      4      2.6    0.8
1      1      5      1.8    1.3
1      1      6      2.1    2.2
1      1      7      0.5    0.2
1      1      8      0.7    0.9
1      1      9      1.1    1.7
1      1      10     2.6    0.8
1      1      11     1.8    1.3
1      1      12     2.1    2.2
1      2      1      0.5    0.2
1      2      2      0.7    0.9
1      2      3      1.1    1.7
1      2      4      2.6    0.8
1      2      5      1.8    1.3
1      2      6      2.1    2.2
1      2      7      0.5    0.2
1      2      9      1.1    1.7
1      2      10     2.6    0.8
1      2      11     1.8    1.3
1      2      12     2.1    2.2

对于第二个分组（

ID_1=1

和

ID_2=2

），数据中缺少一个月（

month=8

）。是否有方法可以找到本月并插入一行，其中包含正确的

ID_1

和

ID_2

值、缺少的

month

值以及其余列的

NA

值？我一直在使用

dplyr

函数来处理这个问题，但似乎没有弄明白，也许还有一个非

dplyr

的解决方案

PS：如果有帮助的话，

ID_1

和

ID_2

的每个唯一分组将缺少不超过1个月的时间。

展开网格以生成所有组的组合，然后合并：

# make reference with all needed rows
ref <- data.frame(expand.grid(unique(df1$ID_1),
                              unique(df1$ID_2),
                              1:12))
colnames(ref) <- colnames(df1)[1:3]

# them merge with all TRUE
res <- merge(df1, ref, all = TRUE)

# to check output, show only month = 8
res[ res$month == 8, ]
#    ID_1 ID_2 month st1 st2
# 8     1    1     8 0.7 0.9
# 20    1    2     8  NA  NA

#引用所有需要的行
ref这可以通过tidyr:：complete
：
library(dplyr)
library(tidyr)

dat %>% 
    group_by(ID_1, ID_2) %>%
    complete(month = 1:12)

数据集尾部：
Source: local data frame [6 x 5]
Groups: ID_1, ID_2 [1]

   ID_1  ID_2 month   st1   st2
  <int> <int> <int> <dbl> <dbl>
1     1     2     7   0.5   0.2
2     1     2     8    NA    NA
3     1     2     9   1.1   1.7
4     1     2    10   2.6   0.8
5     1     2    11   1.8   1.3
6     1     2    12   2.1   2.2

来源：本地数据帧[6 x 5]
分组：ID_1，ID_2[1]
ID\u 1 ID\u 2个月st1 st2
1     1     2     7   0.5   0.2
2128NA
3     1     2     9   1.1   1.7
4     1     2    10   2.6   0.8
5     1     2    11   1.8   1.3
6     1     2    12   2.1   2.2
如果使用tidyr
，则有complete
功能，如果希望将两个变量都作为分组变量，则可以嵌套ID\u 1
和ID\u 2
：
library(tidyr)
df1 = df %>% complete(nesting(ID_1, ID_2), month)

tail(df1)    
# Source: local data frame [6 x 5]

#    ID_1  ID_2 month   st1   st2
#   <int> <int> <int> <dbl> <dbl>
# 1     1     2     7   0.5   0.2
# 2     1     2     8    NA    NA
# 3     1     2     9   1.1   1.7
# 4     1     2    10   2.6   0.8
# 5     1     2    11   1.8   1.3
# 6     1     2    12   2.1   2.2

library（tidyr）
df1=df%>%完成（嵌套（ID_1，ID_2），月）
尾部（df1）
#来源：本地数据帧[6 x 5]
#ID\u 1 ID\u 2个月st1 st2
#       
# 1     1     2     7   0.5   0.2
#2128NA
# 3     1     2     9   1.1   1.7
# 4     1     2    10   2.6   0.8
# 5     1     2    11   1.8   1.3
# 6     1     2    12   2.1   2.2
我不清楚你在找什么。你真的想要一个全新的列来显示缺失月份的值吗？在接下来的几个月里，该专栏的其他价值是什么？它们会是NA吗？我的文章中的措辞不正确，我已经编辑过了。我想在缺少月份的地方插入一个新行，新行的列中填充了NA
（除了ID列）。