Dataframe 建立特定时间段的平均值

Dataframe 建立特定时间段的平均值,dataframe,Dataframe,下面的问题让我头疼得厉害 我有一个像这样的大数据集 Name Date C1 C2 C3 C4 C5 C6 C7 A 2008-01-03 100 A 2008-01-05 NA A 2008-01-07 120 A 2008-02-03 NA A 2008-03-10 50 A 2008-07-14 70 A 2008-07-15 NA A

下面的问题让我头疼得厉害

我有一个像这样的大数据集

Name   Date         C1   C2    C3    C4    C5    C6   C7
 A     2008-01-03   100
 A     2008-01-05   NA
 A     2008-01-07   120
 A     2008-02-03   NA
 A     2008-03-10   50
 A     2008-07-14   70
 A     2008-07-15   NA
 A     2009-01-03   40
 A     2009-01-05   NA
 A     2010-01-07   NA
 A     2010-03-03   30
 A     2010-03-10   20
 A     2011-07-14   10
 A     2011-07-15   NA
 B     2008-01-03   NA
 B     2008-01-05   5
 B     2008-01-07   3
 B     2008-02-03   11
 B     2008-03-10   13
 B     2008-07-14   ....
Name   Date         C1          C2    C3    C4    C5    C6   C7
 A     2008-01  monthly average
 A     2008-02  monthly average
 A     2008-03  monthly average
 A     2008-04  monthly average
 A     2008-05  monthly average
 A     2008-06  monthly average
 A     2008-07  monthly average
 A     2008-08  monthly average
 A     2008-09  monthly average
 A     2008-10  monthly average
 A     2008-11  monthly average
 A     2008-12  monthly average
 A     2009-01  monthly average

 B     2008-01  monthly average
 B     2008-02  monthly average
 B     2008-03  monthly average
 B     2008-04  monthly average
 B     2008-05  monthly average
 B     2008-06   ....
正如你所看到的,在我的观察中有很多NAs。 其他列看起来类似,数据集有+100.000行。所以它是巨大的

我想做的是,我想用以下方式聚合我的数据。 例如C1: 我想在2000-01年到2012-12年间,为每个名字、每年和每月建立一个月平均值

应使用每个月的可用日期计算月平均值

计算完成后,我的数据集应该如下所示

Name   Date         C1   C2    C3    C4    C5    C6   C7
 A     2008-01-03   100
 A     2008-01-05   NA
 A     2008-01-07   120
 A     2008-02-03   NA
 A     2008-03-10   50
 A     2008-07-14   70
 A     2008-07-15   NA
 A     2009-01-03   40
 A     2009-01-05   NA
 A     2010-01-07   NA
 A     2010-03-03   30
 A     2010-03-10   20
 A     2011-07-14   10
 A     2011-07-15   NA
 B     2008-01-03   NA
 B     2008-01-05   5
 B     2008-01-07   3
 B     2008-02-03   11
 B     2008-03-10   13
 B     2008-07-14   ....
Name   Date         C1          C2    C3    C4    C5    C6   C7
 A     2008-01  monthly average
 A     2008-02  monthly average
 A     2008-03  monthly average
 A     2008-04  monthly average
 A     2008-05  monthly average
 A     2008-06  monthly average
 A     2008-07  monthly average
 A     2008-08  monthly average
 A     2008-09  monthly average
 A     2008-10  monthly average
 A     2008-11  monthly average
 A     2008-12  monthly average
 A     2009-01  monthly average

 B     2008-01  monthly average
 B     2008-02  monthly average
 B     2008-03  monthly average
 B     2008-04  monthly average
 B     2008-05  monthly average
 B     2008-06   ....
因此,我的输出数据应该显示每年每个月的每个名称。 如果这个月只有NA值,那么这些值就是NA,或者是这个月的月平均值

例如:

   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   NA
   A     2008-01-07   120
   Name    Date       C1
   A     2008-01-03   NA
   A     2008-01-05   NA
   A     2008-01-07   NA
   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   50
   A     2008-01-07   120
在这方面,我们希望:

   Name    Date       C1
   A     2008-01   (100+120)/2 = 110
    Name    Date       C1
    A     2008-01   NA
    Name    Date       C1
    A     2008-01    (100+50+120)/3 = 90
例如:

   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   NA
   A     2008-01-07   120
   Name    Date       C1
   A     2008-01-03   NA
   A     2008-01-05   NA
   A     2008-01-07   NA
   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   50
   A     2008-01-07   120
在这方面,我们希望:

   Name    Date       C1
   A     2008-01   (100+120)/2 = 110
    Name    Date       C1
    A     2008-01   NA
    Name    Date       C1
    A     2008-01    (100+50+120)/3 = 90
例如:

   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   NA
   A     2008-01-07   120
   Name    Date       C1
   A     2008-01-03   NA
   A     2008-01-05   NA
   A     2008-01-07   NA
   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   50
   A     2008-01-07   120
在这方面,我们希望:

   Name    Date       C1
   A     2008-01   (100+120)/2 = 110
    Name    Date       C1
    A     2008-01   NA
    Name    Date       C1
    A     2008-01    (100+50+120)/3 = 90
因为我对r比较陌生,我不知道如何解决这个问题,我希望能找到一个能解决这个问题的人,告诉我如何解决类似的问题。
我非常感谢您的支持:)

您可以
dplyr::summary_all
计算所有列
C1
C2
的平均值等

首先在
Name
YearMon
上按
分组,取消选择
Date
列,然后使用
summary\u all

library(dplyr)
library(lubridate)

#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%  
  group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
  select(-Date) %>%
  summarise_all("mean", na.rm=TRUE)


#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>% 
  group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
  summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)

# A tibble: 12 x 4
# Groups: Name [?]
   Name  YearMon     C1     C2
   <chr> <chr>    <dbl>  <dbl>
 1 A     2008-01 110    220   
 2 A     2008-02 NaN    NaN   
 3 A     2008-03  50.0  100   
 4 A     2008-07  70.0  140   
 5 A     2009-01  40.0   80.0 
 6 A     2010-01 NaN    NaN   
 7 A     2010-03  25.0   50.0 
 8 A     2011-07  10.0   20.0 
 9 B     2008-01   4.00   8.00
10 B     2008-02  11.0   22.0 
11 B     2008-03  13.0   26.0 
12 B     2008-07 NaN    NaN 
库(dplyr)
图书馆(lubridate)
#增加了C2,以演示一次完成多个柱的计算。
df%>%突变(日期=ymd(日期),C2=C1*2)%>%
分组依据(名称,YearMon=格式(日期,%Y-%m))%>%
选择(-Date)%>%
总结所有内容(“平均值”,na.rm=TRUE)
#或-使用汇总并计算以“C”开头的所有列的平均值
df%>%突变(日期=ymd(日期),C2=C1*2)%>%
分组依据(名称,YearMon=格式(日期,%Y-%m))%>%
总结(变量(以“C”开头),平均值,na.rm=TRUE)
#一个tibble:12x4
#组:名称[?]
姓名:YearMon C1 C2
1 A 2008-01 110 220
2 A 2008-02楠楠楠
3A 2008-03 50.0 100
4 A 2008-07 70.0 140
5A 2009-01 40.0 80.0
6 A 2010-01南南
7A 2010-03 25.0 50.0
8A 2011-07 10.0 20.0
9B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07楠楠
数据:

df <- read.table(text = 
"Name   Date         C1  
A     2008-01-03   100
A     2008-01-05   NA
A     2008-01-07   120
A     2008-02-03   NA
A     2008-03-10   50
A     2008-07-14   70
A     2008-07-15   NA
A     2009-01-03   40
A     2009-01-05   NA
A     2010-01-07   NA
A     2010-03-03   30
A     2010-03-10   20
A     2011-07-14   10
A     2011-07-15   NA
B     2008-01-03   NA
B     2008-01-05   5
B     2008-01-07   3
B     2008-02-03   11
B     2008-03-10   13
B     2008-07-14   NA",
header = TRUE, stringsAsFactors = FALSE)
df
library(dplyr)
#生成样本数据
数据%
分组人(姓名、月份=截止日期、月份)%>%
总结(C1=平均值(C1,na.rm=真))%>%突变(C1=ifelse(is.nan(C1),na,C1))

这可能有助于您共享数据。请参阅此处的更多信息查看
stats
包(即,无安装)。将日期格式化为%Y-%m后,您将希望通过(名称、日期)聚合
,然后将
mean
作为函数传递。这应该让你开始:
aggregate(.~Name+MonthDate,data,FUN=mean,na.rm=TRUE)
。你为什么要OP多次键入
C1、C2、C3
等?@MKR你是对的,我会选择summary\u all而不是summary,但我会坚持使用“cut”而不是“format”从日期开始计算月份。这样我们就可以使用它来排序和合并其他日期objects@bli12blu12在C1到C7的任何字段中是否有非数字数据?空值是否存储为“NA”,而不是实际的空值?在这种情况下,您需要使用类似于data$C1[data$C1==“NA”]@bli12blu12的数据来转换它们,谢谢您的通知。它很容易适应额外的期望。很高兴有一个新的问题。