R 使用lappy和which根据特征和功能对数据帧进行子集
我有一个5维数据的数据框,如下所示:R 使用lappy和which根据特征和功能对数据帧进行子集,r,dataframe,subset,apply,R,Dataframe,Subset,Apply,我有一个5维数据的数据框,如下所示: > dim(alldata) [1] 162 6 > head(alldata) value layer Kmultiplier Resolution Season Variable 1: 0.01308008 b .01K 1km Baseflow Evapotranspiration 2: 0.03974779 b .01K
> dim(alldata)
[1] 162 6
> head(alldata)
value layer Kmultiplier Resolution Season Variable
1: 0.01308008 b .01K 1km Baseflow Evapotranspiration
2: 0.03974779 b .01K 1km Peak Flow Evapotranspiration
3: 0.02396524 b .01K 1km Summer Flow Evapotranspiration
4: -0.15670996 b .01K 1km Baseflow Discharge
5: 0.06774948 b .01K 1km Peak Flow Discharge
6: -0.04138313 b .01K 1km Summer Flow Discharge
> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
value Letter Type Place
1 1 a type1 place1
2 2 b type1 place1
3 3 c type1 place2
4 4 a type2 place2
5 5 b type2 place3
6 6 c type2 place3
7 7 a type3 place1
8 8 b type3 place1
9 9 c type3 place2
我想做的是根据其他列获取数据的某些“特征”的值列的平均值。因此,我使用它将数据子集为我感兴趣的变量,例如:
> subset=alldata[which(alldata$Variable=="Discharge" & alldata$Resolution=="1km" & alldata$Season=="Peak Flow"),]
> subset
value layer Kmultiplier Resolution Season Variable
1: 0.067749478 b .01K 1km Peak Flow Discharge
2: 0.058260448 b .1K 1km Peak Flow Discharge
3: -0.223953725 b 10K 1km Peak Flow Discharge
4: 0.272916114 g .01K 1km Peak Flow Discharge
5: 0.240135025 g .1K 1km Peak Flow Discharge
6: -0.216730348 g 10K 1km Peak Flow Discharge
7: 0.088966500 s .01K 1km Peak Flow Discharge
8: -0.018943754 s .1K 1km Peak Flow Discharge
9: -0.008339365 s 10K 1km Peak Flow Discharge
这就是我被困的地方。假设我想要一个向量或“层”列中每个值的平均值列表。。。所以我会得到3个数字,一个代表“b”,一个代表“g”,一个代表“s”。我需要做一系列这样的子集,我认为apply函数可以提供帮助,但是经过多次教程和堆栈问题之后,我无法让它工作。一个简单的例子也可以,如下所示:
> dim(alldata)
[1] 162 6
> head(alldata)
value layer Kmultiplier Resolution Season Variable
1: 0.01308008 b .01K 1km Baseflow Evapotranspiration
2: 0.03974779 b .01K 1km Peak Flow Evapotranspiration
3: 0.02396524 b .01K 1km Summer Flow Evapotranspiration
4: -0.15670996 b .01K 1km Baseflow Discharge
5: 0.06774948 b .01K 1km Peak Flow Discharge
6: -0.04138313 b .01K 1km Summer Flow Discharge
> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
value Letter Type Place
1 1 a type1 place1
2 2 b type1 place1
3 3 c type1 place2
4 4 a type2 place2
5 5 b type2 place3
6 6 c type2 place3
7 7 a type3 place1
8 8 b type3 place1
9 9 c type3 place2
从这个简单的例子中,我需要“value”列的平均值,以字母形式列出,用于“place1”,它应该返回类似于“a=平均值,b=平均值,c=平均值”的值,无论采用何种格式
这是应用功能的工作吗?如果是,怎么做?如果没有,请告诉我一个更好的数据子集选择
谢谢大家! 在您给出的示例数据集上实施的替代解决方案,且不使用任何
应用系列函数
使用dplyr软件包
library(dplyr)
A %>%
group_by_(.dots = c("Place","Letter")) %>%
summarise(MEAN = mean(value))
# Source: local data frame [6 x 3]
# Groups: Place [?]
# Place Letter MEAN
# <fctr> <fctr> <dbl>
# 1 place1 a 4
# 2 place1 b 5
# 3 place2 a 4
# 4 place2 c 6
# 5 place3 b 5
# 6 place3 c 6
考虑一下by
的tapply
面向对象包装器,它可以跨一个或多个因素(如地点和时间)对数据帧进行子集。从数据帧列表中,可以将行绑定到一个最终df
df_List <- by(A, A[,c("Place", "Letter")],
FUN = function(i) transform(i, mean = mean(i$value)))
finaldf <- do.call(rbind, dfList)
finaldf
# value Letter Type Place mean
# 1 1 a type1 place1 4
# 7 7 a type3 place1 4
# 4 4 a type2 place2 4
# 2 2 b type1 place1 5
# 8 8 b type3 place1 5
# 5 5 b type2 place3 5
# 3 3 c type1 place2 6
# 9 9 c type3 place2 6
# 6 6 c type2 place3 6
df_List谢谢你的建议。我最终选择了ddply,以便按照来自的一般建议将数据转换成更有用的格式
下面是一个简单的例子:
> A=data.frame(seq(1,9),rep(c("a","b","c"),3),c(rep("type1",3),rep("type2",3),rep("type3",3)),c(rep("place1",2),rep("place2",2),rep("place3",2),rep("place1",2),rep("place2",1)))
> names(A)=c("value","Letter","Type","Place")
> A
value Letter Type Place
1 1 a type1 place1
2 2 b type1 place1
3 3 c type1 place2
4 4 a type2 place2
5 5 b type2 place3
6 6 c type2 place3
7 7 a type3 place1
8 8 b type3 place1
9 9 c type3 place2
下面是我的代码,用于查找place1和type1中每个值的“value”平均值:
> sub=ddply(A[which(A$Place=="place1" & A$Type=="type1"),],"value",summarize,mean=mean(value,na.rm=T))
> sub
value mean
1 1 1
2 2 2
由于“sub”已经是一个数据帧,因此添加具有其他特征的列并绘制这些结果很容易
---------------------------------------------------------------------------------
如果您感兴趣,下面是我实际尝试创建的更复杂的数据集子集:
> head(alldata)
value layer Kmultiplier Resolution Season Variable
1: 0.00000000 b 1 1km Baseflow Evapotranspiration
2: 0.01308008 b .01 1km Baseflow Evapotranspiration
3: 0.00000000 b 1 1km Peak Flow Evapotranspiration
4: 0.03974779 b .01 1km Peak Flow Evapotranspiration
5: 0.00000000 b 1 1km Summer Flow Evapotranspiration
6: 0.02396524 b .01 1km Summer Flow Evapotranspiration
我写了几行代码将其子集为可绘制的部分:
for(j in Season){
for(i in res){
ET=ddply(alldata[which(alldata$Variable=="Evapotranspiration" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
ET$Variable="Evapotranspiration";ET$Resolution=sprintf("%s",i);ET$Season=sprintf("%s",j)
S=ddply(alldata[which(alldata$Variable=="Change in Storage" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
S$Variable="Change in Storage";S$Resolution=sprintf("%s",i);S$Season=sprintf("%s",j)
Q=ddply(alldata[which(alldata$Variable=="Discharge" & alldata$Resolution==sprintf("%s",i) & alldata$Season==sprintf("%s",j)),],"Kmultiplier", summarize, mean = mean(value,na.rm=T))
Q$Variable="Discharge";Q$Resolution=sprintf("%s",i);Q$Season=sprintf("%s",j)
if(i=="1km"){resbind=rbind(Q,S,ET)}else{resbind2=rbind(resbind,Q,S,ET)}
}
if(j=="Baseflow"){sbind=rbind(resbind2,Q,S,ET)}else if(j=="Peak Flow"){sbind2=rbind(resbind2,sbind,Q,S,ET)}else{ETSQ=rbind(resbind2,sbind2,Q,S,ET)}
}
ETSQ$Variable=factor(ETSQ$Variable,levels=c("Change in Storage","Evapotranspiration","Discharge"))
print(ggplot(data=ETSQ,aes(x=Kmultiplier,y=mean, color=Variable,group=Variable))
+geom_point()
+geom_line()
+labs(x="K scaled by",y="Percent change from Baseline case")
+scale_y_continuous(labels=percent)
+facet_grid(Season~Resolution)
+theme_bw()
)
ggsave(sprintf("%s/Plots/SimpleLines/Variable_by_K.png",path),device = NULL,scale=1)
最后是结果图:
对于系数列的分组平均值或水平平均值,请使用tapply()
函数。是的,这是*应用
函数的工作。正如@SowmyaS.Manian所说,如果每个组只需要一个值,那么第一个选择将是tapply
,或者ave
如果需要与数据帧中的行一样多的输出值(在每个组中,值是相等的)。最后有人建议使用未充分利用的,但您在向量而不是数据帧上运行它。