r-为每个变量分别求多个列的和
我有一个数据框,有52列,大约850000行。前50列全部编码为是/否。后2列为数字。我的目标是对50个变量的第51列和第52列求和。换句话说,按第1列分组并对第51列和第52列求和,按第2列分组并对第51列和第52列求和,等等。只是想知道最好的方法。下面是一个假数据示例。在下面的数据中,r-为每个变量分别求多个列的和,r,R,我有一个数据框,有52列,大约850000行。前50列全部编码为是/否。后2列为数字。我的目标是对50个变量的第51列和第52列求和。换句话说,按第1列分组并对第51列和第52列求和,按第2列分组并对第51列和第52列求和,等等。只是想知道最好的方法。下面是一个假数据示例。在下面的数据中,val1和val2类似于列51和52,而X1到X5类似于50个分组列。为了得到val1和val2之和,我们将数据融合为长格式,以便X1到X5的列成为“堆叠的”。然后,我们可以轻松地将数据分组并生成总和 libr
val1
和val2
类似于列51和52,而X1
到X5
类似于50个分组列。为了得到val1
和val2
之和,我们将数据融合为长格式,以便X1
到X5
的列成为“堆叠的”。然后,我们可以轻松地将数据分组并生成总和
library(dplyr)
library(reshape2)
# Fake data
set.seed(5)
dat = data.frame(replicate(5,sample(c("Yes","No"),20,replace=TRUE)),
val1=rnorm(20), val2=rnorm(20))
下面是一种使用
apply
和tapply
的方法:
set.seed(123)
d <- data.frame(replicate(5, sample(0:1, 100, replace=TRUE)),
replicate(2, rnorm(100)))
names(d) <- c(paste("col", 1:5), "x", "y")
out <- t(apply(d[,1:5], MAR=2, function(z) {
c(x=tapply(d$x, z, sum), y=tapply(d$y, z, sum))
}))
out
# x.0 x.1 y.0 y.1
# col 1 2.319715 10.255528 -3.623171 -3.3820568
# col 2 4.385023 8.190221 -9.456567 2.4513395
# col 3 6.576423 5.998820 3.154456 -10.1596830
# col 4 8.063604 4.511640 3.879003 -10.8842309
# col 5 7.140356 5.434888 -6.413942 -0.5912855
set.seed(123)
dA类似的数据。表方法:
set.seed(1)
df <- data.frame(replicate(5, sample(c("yes", "no"), 20, replace=TRUE)),
col1 = rnorm(20), col2 = rnorm(20))
library(data.table)
# Convert from wide to long
df1 <- melt(setDT(df), id.vars = c("col1","col2"))
# Calculate the sum for the last 2 columns separately
df2 <- df1[ , lapply(.SD, sum) , by = .(variable, value)]
# Convert back to wide format
dcast(df2, value ~ variable, value.var = c("col1", "col2"))
# value col1_X1 col1_X2 col1_X3 col1_X4 col1_X5 col2_X1 col2_X2 col2_X3 col2_X4 col2_X5
#1: no 2.130194 -0.936481 4.425493 1.322399 2.942901 2.398278 3.385414 -2.1045187 0.5314497 -1.18833735
#2: yes 3.816474 6.883149 1.521175 4.624269 3.003767 -3.602036 -4.589172 0.9007601 -1.7352083 -0.01542122
# Calculate the sum for the last 2 columns together
df2 <- df1[ , sum(unlist(.SD)) , by = .(variable, value)]
dcast(df2, value ~ variable, value.var = "V1")
# value X1 X2 X3 X4 X5
#1: no 4.5284717 2.448933 2.320974 1.853849 1.754564
#2: yes 0.2144379 2.293977 2.421935 2.889061 2.988346
set.seed(1)
df您能提供一个小样本的数据吗?如果您的数据帧被调用为dat
粘贴到dput(dat[1:10,c(1:5,51:52)])的输出中。您可以使用fun.aggregate
而不是生成中间表df2:dcast(df1,value~variable,value.var=c(“col1”,“col2”),fun=sum)
。不幸的是,这在所有列名称中都有一个\u sum\u
,不过@弗兰克,是的,在第二种情况下,我不知道如何做,因此我选择分两步做。有什么想法吗?哦,我想唯一困难的原因是因为df1
仍然不是一个“整洁”的格式(也就是说,它需要再次被融化):dcast(melt(df1,id=c(“variable”,“value”)),value~variable,value.var=“value.1”,fun=sum)
也许你以前见过,但这里是关于“整洁”的参考数据:有趣!我决不会想到再把它融化!
cols group val1 val2
1 X1 No -0.4959896 0.1546875
2 X1 Yes -3.0714078 1.7631670
3 X2 No -0.6323905 1.0422942
4 X2 Yes -2.9350069 0.8755603
5 X3 No 1.7915356 0.9180840
6 X3 Yes -5.3589330 0.9997705
7 X4 No 1.3502926 -1.4184550
8 X4 Yes -4.9176900 3.3363096
9 X5 No 0.7452743 -0.5833465
10 X5 Yes -4.3126717 2.5012010
# Sum of val1 + val2 by group
dat %>%
# Convert to long format
melt(id.var=c("val1","val2"), variable.name="cols", value.name="group") %>%
# Sum val1 and val2 by cols and group
group_by(cols, group) %>%
summarise(sum = sum(val1 + val2))
cols group sum
1 X1 No -0.3413021
2 X1 Yes -1.3082407
3 X2 No 0.4099037
4 X2 Yes -2.0594465
5 X3 No 2.7096196
6 X3 Yes -4.3591625
7 X4 No -0.0681624
8 X4 Yes -1.5813804
9 X5 No 0.1619278
10 X5 Yes -1.8114707
set.seed(123)
d <- data.frame(replicate(5, sample(0:1, 100, replace=TRUE)),
replicate(2, rnorm(100)))
names(d) <- c(paste("col", 1:5), "x", "y")
out <- t(apply(d[,1:5], MAR=2, function(z) {
c(x=tapply(d$x, z, sum), y=tapply(d$y, z, sum))
}))
out
# x.0 x.1 y.0 y.1
# col 1 2.319715 10.255528 -3.623171 -3.3820568
# col 2 4.385023 8.190221 -9.456567 2.4513395
# col 3 6.576423 5.998820 3.154456 -10.1596830
# col 4 8.063604 4.511640 3.879003 -10.8842309
# col 5 7.140356 5.434888 -6.413942 -0.5912855
set.seed(1)
df <- data.frame(replicate(5, sample(c("yes", "no"), 20, replace=TRUE)),
col1 = rnorm(20), col2 = rnorm(20))
library(data.table)
# Convert from wide to long
df1 <- melt(setDT(df), id.vars = c("col1","col2"))
# Calculate the sum for the last 2 columns separately
df2 <- df1[ , lapply(.SD, sum) , by = .(variable, value)]
# Convert back to wide format
dcast(df2, value ~ variable, value.var = c("col1", "col2"))
# value col1_X1 col1_X2 col1_X3 col1_X4 col1_X5 col2_X1 col2_X2 col2_X3 col2_X4 col2_X5
#1: no 2.130194 -0.936481 4.425493 1.322399 2.942901 2.398278 3.385414 -2.1045187 0.5314497 -1.18833735
#2: yes 3.816474 6.883149 1.521175 4.624269 3.003767 -3.602036 -4.589172 0.9007601 -1.7352083 -0.01542122
# Calculate the sum for the last 2 columns together
df2 <- df1[ , sum(unlist(.SD)) , by = .(variable, value)]
dcast(df2, value ~ variable, value.var = "V1")
# value X1 X2 X3 X4 X5
#1: no 4.5284717 2.448933 2.320974 1.853849 1.754564
#2: yes 0.2144379 2.293977 2.421935 2.889061 2.988346
# Result 1
df1 <- melt(setDT(df), id.vars = c("col1","col2"))
dcast(df1, value ~ variable, value.var = c("col1", "col2"), fun = sum)
# Result 2
df1 <- melt(setDT(df), id.vars = c("col1","col2"))
dcast(melt(df1, id = c("variable", "value")), value ~ variable,
value.var = "value.1", fun = sum)