获取每个单独的第n列的总和,并在r中创建新的数据帧
在搜索了类似的帖子后,我发布了我的问题。我有每个站点数年的月降雨量变量。我需要计算历年的月平均降雨量。我给出了一个简单的数据框架,如下所示。我需要创建一个新的数据框架,包括每个站点的月平均值(12)获取每个单独的第n列的总和,并在r中创建新的数据帧,r,sum,seq,tapply,R,Sum,Seq,Tapply,在搜索了类似的帖子后,我发布了我的问题。我有每个站点数年的月降雨量变量。我需要计算历年的月平均降雨量。我给出了一个简单的数据框架,如下所示。我需要创建一个新的数据框架,包括每个站点的月平均值(12) d<-structure(list(ID = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor"), X2000_1 = c(25L, 42L, 74L, 52L), X2000_2 = c(15L, 15L, 5
d<-structure(list(ID = structure(1:4, .Label = c("A", "B", "C",
"D"), class = "factor"), X2000_1 = c(25L, 42L, 74L, 52L), X2000_2 = c(15L,
15L, 51L, 12L), X2000_3 = c(14L, 21L, 25L, 41L), X2000_4 = c(74L,
4L, 23L, 51L), X2000_5 = c(15L, 25L, 65L, 12L), X2000_6 = c(31L,
23L, 15L, 25L), X2001_1 = c(52L, 54L, 18L, 63L), X2001_2 = c(85L,
165L, 12L, 12L), X2001_3 = c(25L, 36L, 20L, 14L), X2001_4 = c(1L,
17L, 23L, 52L), X2001_5 = c(24L, 45L, 12L, 15L), X2001_6 = c(3L,
23L, 45L, 52L)), .Names = c("ID", "X2000_1", "X2000_2", "X2000_3",
"X2000_4", "X2000_5", "X2000_6", "X2001_1", "X2001_2", "X2001_3",
"X2001_4", "X2001_5", "X2001_6"), class = "data.frame", row.names = c(NA,
-4L))
我的实际数据帧的列名是
c("est", "X1990_1", "X1990_2", "X1990_3", "X1990_4", "X1990_5",
"X1990_6", "X1990_7", "X1990_8", "X1990_9", "X1990_10", "X1990_11",
"X1990_12", "X1991_1", "X1991_2", "X1991_3", "X1991_4", "X1991_5",
"X1991_6", "X1991_7", "X1991_8", "X1991_9", "X1991_10", "X1991_11",
"X1991_12", "X1992_1", "X1992_2", "X1992_3", "X1992_4", "X1992_5",
"X1992_6", "X1992_7", "X1992_8", "X1992_9", "X1992_10", "X1992_11",
"X1992_12", "X1993_1", "X1993_2", "X1993_3", "X1993_4", "X1993_5",
"X1993_6", "X1993_7", "X1993_8", "X1993_9", "X1993_10", "X1993_11",
"X1993_12", "X1994_1", "X1994_2", "X1994_3", "X1994_4", "X1994_5",
"X1994_6", "X1994_7", "X1994_8", "X1994_9", "X1994_10", "X1994_11",
"X1994_12", "X1995_1", "X1995_2", "X1995_3", "X1995_4", "X1995_5",
"X1995_6", "X1995_7", "X1995_8", "X1995_9", "X1995_10", "X1995_11",
"X1995_12", "X1996_1", "X1996_2", "X1996_3", "X1996_4", "X1996_5",
"X1996_6", "X1996_7", "X1996_8", "X1996_9", "X1996_10", "X1996_11",
"X1996_12", "X1997_1", "X1997_2", "X1997_3", "X1997_4", "X1997_5",
"X1997_6", "X1997_7", "X1997_8", "X1997_9", "X1997_10", "X1997_11",
"X1997_12", "X1998_1", "X1998_2", "X1998_3", "X1998_4", "X1998_5",
"X1998_6", "X1998_7", "X1998_8", "X1998_9", "X1998_10", "X1998_11",
"X1998_12", "X1999_1", "X1999_2", "X1999_3", "X1999_4", "X1999_5",
"X1999_6", "X1999_7", "X1999_8", "X1999_9", "X1999_10", "X1999_11",
"X1999_12", "X2000_1", "X2000_2", "X2000_3", "X2000_4", "X2000_5",
"X2000_6", "X2000_7", "X2000_8", "X2000_9", "X2000_10", "X2000_11",
"X2000_12")
您可以从列名中提取月数作为变量,并按月数变量将数据框拆分为列表,并使用
rowMeans()
函数计算每个子数据框的行平均值:
# extract the months for each column
mon <- sub(".*_(\\d+)$", "\\1", names(d)[-1])
# split the data frame by columns and calculate the rowMeans
cbind.data.frame(d[1], lapply(split.default(d[-1], mon), rowMeans))
# ID 1 2 3 4 5 6
#1 A 38.5 50.0 19.5 37.5 19.5 17.0
#2 B 48.0 90.0 28.5 10.5 35.0 23.0
#3 C 46.0 31.5 22.5 23.0 38.5 30.0
#4 D 57.5 12.0 27.5 51.5 13.5 38.5
#提取每列的月份
mon您还可以使用一些重塑
-ing将数据集改为长数据集,以及制表:
tmp <- reshape(d, idvar="ID", sep="_", direction="long", varying=-1)
xtabs(rowMeans(cbind(X2000,X2001)) ~ ID + time, data=tmp)
# time
#ID 1 2 3 4 5 6
# A 38.5 50.0 19.5 37.5 19.5 17.0
# B 48.0 90.0 28.5 10.5 35.0 23.0
# C 46.0 31.5 22.5 23.0 38.5 30.0
# D 57.5 12.0 27.5 51.5 13.5 38.5
tmp假设第一列为ID
,其余所有列都是均匀分布的
我们可以把数据帧分成两半,然后得到它们之间的平均值吗
cbind(d[1],(d[2:ceiling(ncol(d)/2)] + d[(ceiling(ncol(d)/2) + 1):ncol(d)])/2)
# ID X2000_1 X2000_2 X2000_3 X2000_4 X2000_5 X2000_6
#1 A 38.5 50.0 19.5 37.5 19.5 17.0
#2 B 48.0 90.0 28.5 10.5 35.0 23.0
#3 C 46.0 31.5 22.5 23.0 38.5 30.0
#4 D 57.5 12.0 27.5 51.5 13.5 38.5
显然,我们可以通过硬编码列号来实现
cbind(d[1],(d[2:7]+d[8:13])/2)
然而,上面提到的方法是通用的,即使我们有超过13列,它也能工作 据我所知,要获取文件的签出信息,您需要找到工作空间,然后找到这些工作空间上所有挂起的更改。这里有一个选项,使用Reduce
和+
cbind(d[1], Reduce(`+`, list(d[2:7], d[8:13]))/2)
# ID X2000_1 X2000_2 X2000_3 X2000_4 X2000_5 X2000_6
#1 A 38.5 50.0 19.5 37.5 19.5 17.0
#2 B 48.0 90.0 28.5 10.5 35.0 23.0
#3 C 46.0 31.5 22.5 23.0 38.5 30.0
#4 D 57.5 12.0 27.5 51.5 13.5 38.5
或者只是
cbind(d[1], (d[2:7] + d[8:13])/2)
当解决方案用于与示例类似的另一个数据帧时,在base::rowMeans(x,na.rm=na.rm,dims=dims,…)中会出现错误消息:“x”必须是数字真实数据的列名是什么?并运行lappy(d[-1],class)
查看除ID之外的所有列是否都是数字类型。在我的真实数据中,除ID之外的所有列都是数字类型。我查过了。与此示例唯一不同的是,NAs持续数月。您可以通过将na.rm
参数传递到lappy
来删除na,就像cbind.data.frame(d[1],lappy(split.default(d[-1],mon),rowMeans,na.rm=TRUE))
一样,但这与错误消息不匹配。你还说mon
没有给出1,…6
,也许这就是问题所在。实际的列名是什么?也许他们有多个下划线?啊,我知道问题出在哪里了。我应该解释得更清楚些。第一个\\d+
是regex符号,它代表数字或[0-9],与数据帧名称无关,因此应保留该符号。请改为尝试此操作,mon@Psidom当解决方案用于与示例类似的另一个数据帧时,在base::rowMeans(x,na.rm=na.rm,dims=dims,…)中会出现错误消息:“x”必须是数字。我相信这和我有关系。因为当调用mon时,它不会创建1,6,而是创建原始列名(X2000_1),即计算时如何删除NAshere@sriya如果存在NAs,则psidom提供的解决方案或LateMail的解决方案会更好,因为rowMeans
havena.rm=TRUE
参数
cbind(d[1], Reduce(`+`, list(d[2:7], d[8:13]))/2)
# ID X2000_1 X2000_2 X2000_3 X2000_4 X2000_5 X2000_6
#1 A 38.5 50.0 19.5 37.5 19.5 17.0
#2 B 48.0 90.0 28.5 10.5 35.0 23.0
#3 C 46.0 31.5 22.5 23.0 38.5 30.0
#4 D 57.5 12.0 27.5 51.5 13.5 38.5
cbind(d[1], (d[2:7] + d[8:13])/2)