R 如何通过公共ID从大型数据集中提取行,并利用这些行的方式生成具有这些ID的列
我知道这是一个非常愚蠢的问题,但我无法解决它,这就是为什么问。。。 如何通过公共ID从大型数据集中提取行,并利用这些行的方法生成一个列,其中包含这些ID作为行名。 e、 gR 如何通过公共ID从大型数据集中提取行,并利用这些行的方式生成具有这些ID的列,r,data.table,plyr,R,Data.table,Plyr,我知道这是一个非常愚蠢的问题,但我无法解决它,这就是为什么问。。。 如何通过公共ID从大型数据集中提取行,并利用这些行的方法生成一个列,其中包含这些ID作为行名。 e、 g 使用plyr功能ddply可以轻松完成此类操作: dat = data.frame(ID = rep(LETTERS[1:5], each = 20), value = runif(100)) > head(dat) ID value 1 A 0.45800889 2 A 0.11221072 3
使用
plyr
功能ddply
可以轻松完成此类操作:
dat = data.frame(ID = rep(LETTERS[1:5], each = 20), value = runif(100))
> head(dat)
ID value
1 A 0.45800889
2 A 0.11221072
3 A 0.58833532
4 A 0.70056704
5 A 0.08337996
6 A 0.05195357
ddply(dat, .(ID), summarize, mn = mean(value))
ID mn
1 A 0.4960083
2 B 0.5809681
3 C 0.4512388
4 D 0.5079790
5 E 0.5397708
如果您的数据集很大,并且/或者唯一
ID
的数量很大,则可以使用data.table
。有关plyr
的更多详细信息,请参阅。如果您有一个大的data.frame,则可以使用data.table
set.seed(001)
dat <- data.frame(ID = rep(LETTERS[1:5], each = 20), value = runif(1e6))
library(data.table)
DT <- data.table(dat)
DT[, mean(value), by=list(ID)] # data.table approach
aggregate(.~ID, data=dat, mean) # aggregate (R Base function) approach
library(rbenchmark) # comparing performance
benchmark(DT[, mean(value), by=list(ID)], # data.table approach
aggregate(.~ID, data=dat, mean), # aggregate approach
ddply(dat, .(ID), summarize, mn = mean(value)), # ddply approach (Paul Hiemstra's answer)
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=1)
test replications elapsed relative
1 DT[, mean(value), by = list(ID)] 1 0.14 1.000
3 ddply(dat, .(ID), summarize, mn = mean(value)) 1 0.58 4.143
2 aggregate(. ~ ID, data = dat, mean) 1 3.59 25.643
ddply
的一些替代方案是aggregate
和data.table
set.seed(001)
dat <- data.frame(ID = rep(LETTERS[1:5], each = 20), value = runif(1e6))
library(data.table)
DT <- data.table(dat)
DT[, mean(value), by=list(ID)] # data.table approach
aggregate(.~ID, data=dat, mean) # aggregate (R Base function) approach
library(rbenchmark) # comparing performance
benchmark(DT[, mean(value), by=list(ID)], # data.table approach
aggregate(.~ID, data=dat, mean), # aggregate approach
ddply(dat, .(ID), summarize, mn = mean(value)), # ddply approach (Paul Hiemstra's answer)
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=1)
test replications elapsed relative
1 DT[, mean(value), by = list(ID)] 1 0.14 1.000
3 ddply(dat, .(ID), summarize, mn = mean(value)) 1 0.58 4.143
2 aggregate(. ~ ID, data = dat, mean) 1 3.59 25.643
Venables和Ripley(2000年,第37页)提出,将未列出
、拉普拉
和拆分
相结合比仅使用sapply
更快,在这个特定的例子中,它甚至比数据更快
set.seed(001)
dat <- data.frame(ID = rep(LETTERS[1:5], each = 20), value = runif(1e6))
library(data.table)
DT <- data.table(dat)
DT[, mean(value), by=list(ID)] # data.table approach
aggregate(.~ID, data=dat, mean) # aggregate (R Base function) approach
library(rbenchmark) # comparing performance
benchmark(DT[, mean(value), by=list(ID)], # data.table approach
aggregate(.~ID, data=dat, mean), # aggregate approach
ddply(dat, .(ID), summarize, mn = mean(value)), # ddply approach (Paul Hiemstra's answer)
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=1)
test replications elapsed relative
1 DT[, mean(value), by = list(ID)] 1 0.14 1.000
3 ddply(dat, .(ID), summarize, mn = mean(value)) 1 0.58 4.143
2 aggregate(. ~ ID, data = dat, mean) 1 3.59 25.643
参考:
维纳布尔斯,W.N.和里普利,B.D.(2000年)。S编程。斯普林格。统计和计算
ISBN 0-387-98966-8(碱性纸)
放大(从Matthew Dowle编辑)
更多群组
dat <- data.frame(ID = as.character(as.hexmode(1:2000)), value = runif(1e6))
DT <- as.data.table(dat)
benchmark(
DT[, mean(value), by=ID],
aggregate(.~ID, data=dat, mean),
ddply(dat, .(ID), summarize, mn = mean(value)),
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 0.33 1.000
4 unlist(lapply(split(dat$value, dat$ID), mean)) 3 0.41 1.242
2 aggregate(. ~ ID, data = dat, mean) 3 7.69 23.303
3 ddply(dat, .(ID), summarize, mn = mean(value)) 3 17.08 51.758
dat <- data.frame(ID = as.character(as.hexmode(1:2000)), value = runif(1e7))
DT <- as.data.table(dat)
benchmark(
DT[, mean(value), by=ID],
aggregate(.~ID, data=dat, mean),
ddply(dat, .(ID), summarize, mn = mean(value)),
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 3.18 1.000
4 unlist(lapply(split(dat$value, dat$ID), mean)) 3 4.26 1.340
2 aggregate(. ~ ID, data = dat, mean) 3 90.28 28.390
3 ddply(dat, .(ID), summarize, mn = mean(value)) 3 268.86 84.547
dat <- data.frame(ID = rep(1:2000,each=50000), value = runif(1e8))
DT <- as.data.table(dat)
system.time(setkey(DT,ID))
user system elapsed
2.10 0.25 2.34
object.size(dat)
1.1 Gb # Comfortable for a 64bit PC with 8GB RAM
object.size(DT)
1.1 Gb
benchmark(
DT[, mean(value), by=ID],
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 7.30 1.000
2 unlist(lapply(split(dat$value, dat$ID), mean)) 3 184.83 25.319
更多行
dat <- data.frame(ID = as.character(as.hexmode(1:2000)), value = runif(1e6))
DT <- as.data.table(dat)
benchmark(
DT[, mean(value), by=ID],
aggregate(.~ID, data=dat, mean),
ddply(dat, .(ID), summarize, mn = mean(value)),
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 0.33 1.000
4 unlist(lapply(split(dat$value, dat$ID), mean)) 3 0.41 1.242
2 aggregate(. ~ ID, data = dat, mean) 3 7.69 23.303
3 ddply(dat, .(ID), summarize, mn = mean(value)) 3 17.08 51.758
dat <- data.frame(ID = as.character(as.hexmode(1:2000)), value = runif(1e7))
DT <- as.data.table(dat)
benchmark(
DT[, mean(value), by=ID],
aggregate(.~ID, data=dat, mean),
ddply(dat, .(ID), summarize, mn = mean(value)),
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 3.18 1.000
4 unlist(lapply(split(dat$value, dat$ID), mean)) 3 4.26 1.340
2 aggregate(. ~ ID, data = dat, mean) 3 90.28 28.390
3 ddply(dat, .(ID), summarize, mn = mean(value)) 3 268.86 84.547
dat <- data.frame(ID = rep(1:2000,each=50000), value = runif(1e8))
DT <- as.data.table(dat)
system.time(setkey(DT,ID))
user system elapsed
2.10 0.25 2.34
object.size(dat)
1.1 Gb # Comfortable for a 64bit PC with 8GB RAM
object.size(DT)
1.1 Gb
benchmark(
DT[, mean(value), by=ID],
unlist(lapply(split(dat$value, dat$ID), mean)),
columns=c("test", "replications", "elapsed", "relative"),
order='relative',
replications=3)
test replications elapsed relative
1 DT[, mean(value), by = ID] 3 7.30 1.000
2 unlist(lapply(split(dat$value, dat$ID), mean)) 3 184.83 25.319
dat非常感谢您对Paul Hiemstra的帮助,感谢您的论文推荐,我以后也会注意接受答案。对不起,我没有接受之前的答案,尽管它们都很好用。没问题!只是提到它正在发生:)+1也。但单次运行的0.06 vs 0.10既不显著也不稳健。unlist
、lappy
和split
方法不可缩放。它返回一个不同的结果。