R 用独立组的中值在数据帧中划分组
我有一个数据框,其中包含一列R 用独立组的中值在数据帧中划分组,r,R,我有一个数据框,其中包含一列组ID和类,以及多个数字特征和一些字符元数据,即: group_ID class var1 var2 var3 metadata a foo 1 324 3 cat a bar 1.3 34 53 dog a baz 31 34 5 elephant b foo 34 34 943 dolphin b
组ID
和类
,以及多个数字特征和一些字符元数据,即:
group_ID class var1 var2 var3 metadata
a foo 1 324 3 cat
a bar 1.3 34 53 dog
a baz 31 34 5 elephant
b foo 34 34 943 dolphin
b bar 94 51 23 chipmunk
b baz 985 595 43 badger
c foo 43 93 23 tapir
c bar 43 23 23 monkey
c baz 40 53 512 duck
我想为每个组ID
计算类foo
的中位数,然后将每行除以与组ID
匹配的中位数
在本例中,每个foo
只有一行,因此中值将与初始值相同,但实际上,每个类和组ID都有许多行
有没有一个简单的方法可以做到这一点?我对far的最佳尝试包括为foo
的中值创建一个单独的数据框,然后按组ID进行拆分,并在一个可怕的循环中进行扫描,但我最终丢失了元数据列。这似乎是一件很平常的事情,所以我肯定我错过了一些东西
任何帮助都将不胜感激。我们可以使用dplyr
中的mutate\u each
除以条件
library(dplyr)
df %>% group_by(group_ID) %>%
mutate_each(funs(./median(.[class == "foo"])), var1:var3)
# Source: local data frame [9 x 6]
# Groups: group_ID
#
# group_ID class var1 var2 var3 metadata
# 1 a foo 1.0000000 1.0000000 1.00000000 cat
# 2 a bar 1.3000000 0.1049383 17.66666667 dog
# 3 a baz 31.0000000 0.1049383 1.66666667 elephant
# 4 b foo 1.0000000 1.0000000 1.00000000 dolphin
# 5 b bar 2.7647059 1.5000000 0.02439024 chipmunk
# 6 b baz 28.9705882 17.5000000 0.04559915 badger
# 7 c foo 1.0000000 1.0000000 1.00000000 tapir
# 8 c bar 1.0000000 0.2473118 1.00000000 monkey
# 9 c baz 0.9302326 0.5698925 22.26086957 duck
如果OP希望将这些列添加为新列/附加列并保持以前的数据不变,您可以将上述方法修改为:
df %>%
group_by(group_ID) %>%
mutate_each(funs(./median(.[class == "foo"])), setNames(var1:var3, paste0("varN", 1:3)))
这是一个data.table
解决方案。我们将'data.frame'转换为'data.table'(setDT(df)
),按'group_ID'分组,通过以列名'var'开头的列子集循环(使用grep
我们正在子集),将每列除以该列子集的中位数,该子集对应于“类”中的“foo”值。可以将其分配(:=
)为新列,也可以将其分配回同一列以替换原始列。替换原始列的一个问题是,我们应该将原始列的类
与替换列相匹配。如果“var”列的原始类是数值
,则它将用作中值
计算,并将新列除法转换为数值
。如果原始列是整数
类,一个可能的选项是将类更改为数值
,然后分配
library(data.table)
setDT(df)[, paste0("varN", 1:3) := lapply(.SD[,
grep("^var", names(.SD)), with=FALSE],
function(x) x/median(x[class=="foo"])), group_ID]
df
# group_ID class var1 var2 var3 metadata varN1 varN2 varN3
#1: a foo 1.0 324 3 cat 1.0000000 1.0000000 1.00000000
#2: a bar 1.3 34 53 dog 1.3000000 0.1049383 17.66666667
#3: a baz 31.0 34 5 elephant 31.0000000 0.1049383 1.66666667
#4: b foo 34.0 34 943 dolphin 1.0000000 1.0000000 1.00000000
#5: b bar 94.0 51 23 chipmunk 2.7647059 1.5000000 0.02439024
#6: b baz 985.0 595 43 badger 28.9705882 17.5000000 0.04559915
#7: c foo 43.0 93 23 tapir 1.0000000 1.0000000 1.00000000
#8: c bar 43.0 23 23 monkey 1.0000000 0.2473118 1.00000000
#9: c baz 40.0 53 512 duck 0.9302326 0.5698925 22.26086957
1)通过这里是一个基本的R解决方案:
do.call("rbind", by(DF, DF$group_ID, function(d)
data.frame(d, sapply(d[3:5], function(x) x / median(x[d$class == "foo"])))
))
给予:
group_ID class var1 var2 var3 metadata var1.1 var2.1 var3.1
a.1 a foo 1.0 324 3 cat 1.0000000 1.0000000 1.00000000
a.2 a bar 1.3 34 53 dog 1.3000000 0.1049383 17.66666667
a.3 a baz 31.0 34 5 elephant 31.0000000 0.1049383 1.66666667
b.4 b foo 34.0 34 943 dolphin 1.0000000 1.0000000 1.00000000
b.5 b bar 94.0 51 23 chipmunk 2.7647059 1.5000000 0.02439024
b.6 b baz 985.0 595 43 badger 28.9705882 17.5000000 0.04559915
c.7 c foo 43.0 93 23 tapir 1.0000000 1.0000000 1.00000000
c.8 c bar 43.0 23 23 monkey 1.0000000 0.2473118 1.00000000
c.9 c baz 40.0 53 512 duck 0.9302326 0.5698925 22.26086957
2)通过/sweep使用sweep
的替代方法,同样,只有基本函数:
do.call("rbind", by(DF, DF$group_ID, function(d) {
med <- apply(subset(d, class == "foo")[3:5], 2, median)
data.frame(d, sweep(as.matrix(d[3:5]), 2, med, "/"))
}))
注:可复制形式的输入DF
:
DF <- structure(list(group_ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), class = structure(c(3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("bar", "baz", "foo"
), class = "factor"), var1 = c(1, 1.3, 31, 34, 94, 985, 43, 43,
40), var2 = c(324L, 34L, 34L, 34L, 51L, 595L, 93L, 23L, 53L),
var3 = c(3L, 53L, 5L, 943L, 23L, 43L, 23L, 23L, 512L), metadata = structure(c(2L,
4L, 7L, 5L, 3L, 1L, 9L, 8L, 6L), .Label = c("badger", "cat",
"chipmunk", "dog", "dolphin", "duck", "elephant", "monkey",
"tapir"), class = "factor")), .Names = c("group_ID", "class",
"var1", "var2", "var3", "metadata"), class = "data.frame", row.names = c(NA,
-9L))
DF这对我来说很有效,但是你应该添加这个例子的预期结果,以确保DF%>%groupby(groupid)%%>%mutate\u each(funs(./median(.[class==“bar”]),var1:var3)
谢谢@docendodiscimusThankyou,看起来应该可以做到这一点。有一个dplyr解决方案并不奇怪。@PierreLafortune我之前忘了用.SD
和grep
一起使用,而且有点忙着开会。从统计学角度考虑一下。如果你获得100张赞成票,你将获得一张反对票。统计上不显著。:)谢谢你们的评论。
DF <- structure(list(group_ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), class = structure(c(3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("bar", "baz", "foo"
), class = "factor"), var1 = c(1, 1.3, 31, 34, 94, 985, 43, 43,
40), var2 = c(324L, 34L, 34L, 34L, 51L, 595L, 93L, 23L, 53L),
var3 = c(3L, 53L, 5L, 943L, 23L, 43L, 23L, 23L, 512L), metadata = structure(c(2L,
4L, 7L, 5L, 3L, 1L, 9L, 8L, 6L), .Label = c("badger", "cat",
"chipmunk", "dog", "dolphin", "duck", "elephant", "monkey",
"tapir"), class = "factor")), .Names = c("group_ID", "class",
"var1", "var2", "var3", "metadata"), class = "data.frame", row.names = c(NA,
-9L))