R 计算数据帧中的差异

R 计算数据帧中的差异,r,mean,R,Mean,我有一个如下所示的数据帧: set.seed(50) data.frame(distance=c(rep("long", 5), rep("short", 5)), year=rep(2002:2006), mean.length=rnorm(10)) distance year mean.length 1 long 2002 0.54966989 2 long 2003 -0.84160374 3 long 2

我有一个如下所示的数据帧:

set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
           year=rep(2002:2006),
           mean.length=rnorm(10))

   distance year mean.length
1      long 2002  0.54966989
2      long 2003 -0.84160374
3      long 2004  0.03299794
4      long 2005  0.52414971
5      long 2006 -1.72760411
6     short 2002 -0.27786453
7     short 2003  0.36082844
8     short 2004 -0.59091244
9     short 2005  0.97559055
10    short 2006 -1.44574995
apply(
  with(x, tapply(mean.length, list(year, distance), FUN=mean)),
  1, 
  diff
)

      2002       2003       2004       2005       2006 
-0.8275344  1.2024322 -0.6239104  0.4514408  0.2818542 

我需要计算每年的长与短的平均长度之差。最快的方法是什么?

这里有一种使用plyr的方法:

set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
                 year=rep(2002:2006),
                 mean.length=rnorm(10))

library(plyr)
aggregation.fn <- function(df) {
  data.frame(year=df$year[1],
             diff=(df$mean.length[df$distance == "long"] -
                   df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
第二条路

df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]

all(new.df.2 == new.df)  # True

以下是使用plyr的一种方法:

set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
                 year=rep(2002:2006),
                 mean.length=rnorm(10))

library(plyr)
aggregation.fn <- function(df) {
  data.frame(year=df$year[1],
             diff=(df$mean.length[df$distance == "long"] -
                   df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
第二条路

df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]

all(new.df.2 == new.df)  # True
使用tapply并按如下方式涂抹:

set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
           year=rep(2002:2006),
           mean.length=rnorm(10))

   distance year mean.length
1      long 2002  0.54966989
2      long 2003 -0.84160374
3      long 2004  0.03299794
4      long 2005  0.52414971
5      long 2006 -1.72760411
6     short 2002 -0.27786453
7     short 2003  0.36082844
8     short 2004 -0.59091244
9     short 2005  0.97559055
10    short 2006 -1.44574995
apply(
  with(x, tapply(mean.length, list(year, distance), FUN=mean)),
  1, 
  diff
)

      2002       2003       2004       2005       2006 
-0.8275344  1.2024322 -0.6239104  0.4514408  0.2818542 
这是因为tapply按年份和距离创建了表格摘要:

使用tapply并按如下方式涂抹:

set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
           year=rep(2002:2006),
           mean.length=rnorm(10))

   distance year mean.length
1      long 2002  0.54966989
2      long 2003 -0.84160374
3      long 2004  0.03299794
4      long 2005  0.52414971
5      long 2006 -1.72760411
6     short 2002 -0.27786453
7     short 2003  0.36082844
8     short 2004 -0.59091244
9     short 2005  0.97559055
10    short 2006 -1.44574995
apply(
  with(x, tapply(mean.length, list(year, distance), FUN=mean)),
  1, 
  diff
)

      2002       2003       2004       2005       2006 
-0.8275344  1.2024322 -0.6239104  0.4514408  0.2818542 
这是因为tapply按年份和距离创建了表格摘要:


由于您似乎有成对的值,并且data.frame是有序的,因此可以执行以下操作:

res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)

#     2002       2003       2004       2005       2006 
#0.8275344 -1.2024322  0.6239104 -0.4514408 -0.2818542 

这应该很快,但不像其他答案那样安全,因为它依赖于假设。

因为您似乎有成对的值,并且data.frame是有序的,所以您可以执行以下操作:

res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)

#     2002       2003       2004       2005       2006 
#0.8275344 -1.2024322  0.6239104 -0.4514408 -0.2818542 

这应该很快,但不像其他答案那样安全,因为它依赖于假设。

您已经收到了一些关于计算手头特定问题的好答案。考虑将数据重新调整为宽格式可能是有意义的。这里有两个选项:

reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
  year mean.length.long mean.length.short
1 2002       0.54966989        -0.2778645
2 2003      -0.84160374         0.3608284
3 2004       0.03299794        -0.5909124
4 2005       0.52414971         0.9755906
5 2006      -1.72760411        -1.4457499

#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
  year        long      short
1 2002  0.54966989 -0.2778645
2 2003 -0.84160374  0.3608284
3 2004  0.03299794 -0.5909124
4 2005  0.52414971  0.9755906
5 2006 -1.72760411 -1.4457499

您现在可以轻松地计算新的统计数据。

您已经收到了一些关于计算手头特定问题的好答案。考虑将数据重新调整为宽格式可能是有意义的。这里有两个选项:

reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
  year mean.length.long mean.length.short
1 2002       0.54966989        -0.2778645
2 2003      -0.84160374         0.3608284
3 2004       0.03299794        -0.5909124
4 2005       0.52414971         0.9755906
5 2006      -1.72760411        -1.4457499

#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
  year        long      short
1 2002  0.54966989 -0.2778645
2 2003 -0.84160374  0.3608284
3 2004  0.03299794 -0.5909124
4 2005  0.52414971  0.9755906
5 2006 -1.72760411 -1.4457499

您现在可以轻松地计算新的统计数据。

您可以使用ddplydf、year、Summary、val=mean.length[distance=='long']-mean.length[distance=='short']保存一些键入内容。很酷,这也可以。我不知道Summary,谢谢:你可以用ddplydf保存一些输入,年份,Summary,val=mean.length[distance=='long']-mean.length[distance=='short'],可能吧。酷,这也行。我不知道总结,谢谢: