Dataframe 建立特定时间段的平均值
下面的问题让我头疼得厉害 我有一个像这样的大数据集Dataframe 建立特定时间段的平均值,dataframe,Dataframe,下面的问题让我头疼得厉害 我有一个像这样的大数据集 Name Date C1 C2 C3 C4 C5 C6 C7 A 2008-01-03 100 A 2008-01-05 NA A 2008-01-07 120 A 2008-02-03 NA A 2008-03-10 50 A 2008-07-14 70 A 2008-07-15 NA A
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....
正如你所看到的,在我的观察中有很多NAs。
其他列看起来类似,数据集有+100.000行。所以它是巨大的
我想做的是,我想用以下方式聚合我的数据。
例如C1:
我想在2000-01年到2012-12年间,为每个名字、每年和每月建立一个月平均值
应使用每个月的可用日期计算月平均值
计算完成后,我的数据集应该如下所示
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....
因此,我的输出数据应该显示每年每个月的每个名称。
如果这个月只有NA值,那么这些值就是NA,或者是这个月的月平均值
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
在这方面,我们希望:
Name Date C1
A 2008-01 (100+120)/2 = 110
Name Date C1
A 2008-01 NA
Name Date C1
A 2008-01 (100+50+120)/3 = 90
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
在这方面,我们希望:
Name Date C1
A 2008-01 (100+120)/2 = 110
Name Date C1
A 2008-01 NA
Name Date C1
A 2008-01 (100+50+120)/3 = 90
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
在这方面,我们希望:
Name Date C1
A 2008-01 (100+120)/2 = 110
Name Date C1
A 2008-01 NA
Name Date C1
A 2008-01 (100+50+120)/3 = 90
因为我对r比较陌生,我不知道如何解决这个问题,我希望能找到一个能解决这个问题的人,告诉我如何解决类似的问题。
我非常感谢您的支持:)您可以
dplyr::summary_all
计算所有列C1
,C2
的平均值等
首先在Name
和YearMon
上按分组,取消选择Date
列,然后使用summary\u all
library(dplyr)
library(lubridate)
#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
select(-Date) %>%
summarise_all("mean", na.rm=TRUE)
#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)
# A tibble: 12 x 4
# Groups: Name [?]
Name YearMon C1 C2
<chr> <chr> <dbl> <dbl>
1 A 2008-01 110 220
2 A 2008-02 NaN NaN
3 A 2008-03 50.0 100
4 A 2008-07 70.0 140
5 A 2009-01 40.0 80.0
6 A 2010-01 NaN NaN
7 A 2010-03 25.0 50.0
8 A 2011-07 10.0 20.0
9 B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07 NaN NaN
库(dplyr)
图书馆(lubridate)
#增加了C2,以演示一次完成多个柱的计算。
df%>%突变(日期=ymd(日期),C2=C1*2)%>%
分组依据(名称,YearMon=格式(日期,%Y-%m))%>%
选择(-Date)%>%
总结所有内容(“平均值”,na.rm=TRUE)
#或-使用汇总并计算以“C”开头的所有列的平均值
df%>%突变(日期=ymd(日期),C2=C1*2)%>%
分组依据(名称,YearMon=格式(日期,%Y-%m))%>%
总结(变量(以“C”开头),平均值,na.rm=TRUE)
#一个tibble:12x4
#组:名称[?]
姓名:YearMon C1 C2
1 A 2008-01 110 220
2 A 2008-02楠楠楠
3A 2008-03 50.0 100
4 A 2008-07 70.0 140
5A 2009-01 40.0 80.0
6 A 2010-01南南
7A 2010-03 25.0 50.0
8A 2011-07 10.0 20.0
9B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07楠楠
数据:
df <- read.table(text =
"Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 NA",
header = TRUE, stringsAsFactors = FALSE)
dflibrary(dplyr)
#生成样本数据
数据%
分组人(姓名、月份=截止日期、月份)%>%
总结(C1=平均值(C1,na.rm=真))%>%突变(C1=ifelse(is.nan(C1),na,C1))
这可能有助于您共享数据。请参阅此处的更多信息查看stats
包(即,无安装)。将日期格式化为%Y-%m后,您将希望通过(名称、日期)聚合,然后将mean
作为函数传递。这应该让你开始:aggregate(.~Name+MonthDate,data,FUN=mean,na.rm=TRUE)
。你为什么要OP多次键入C1、C2、C3
等?@MKR你是对的,我会选择summary\u all而不是summary,但我会坚持使用“cut”而不是“format”从日期开始计算月份。这样我们就可以使用它来排序和合并其他日期objects@bli12blu12在C1到C7的任何字段中是否有非数字数据?空值是否存储为“NA”,而不是实际的空值?在这种情况下,您需要使用类似于data$C1[data$C1==“NA”]@bli12blu12的数据来转换它们,谢谢您的通知。它很容易适应额外的期望。很高兴有一个新的问题。