R:data.table按组计算多个变量的加权平均值,每个变量具有多个权重变量
我还不熟悉R:data.table按组计算多个变量的加权平均值,每个变量具有多个权重变量,r,list,data.table,weighted-average,R,List,Data.table,Weighted Average,我还不熟悉数据表。我的问题类似于和。不同的是,我想按组计算多个变量的加权平均值,但对每个平均值使用多个权重 考虑以下数据。表(实际值要大得多): 我通过在中添加了第一个键变量,尽管它是一个常量,因为我希望在输出中将它作为一列。我得到: CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.66824 11.66819 11.
数据表。我的问题类似于和。不同的是,我想按组计算多个变量的加权平均值,但对每个平均值使用多个权重
考虑以下数据。表
(实际值要大得多):
我通过
在中添加了第一个键变量,尽管它是一个常量,因为我希望在输出中将它作为一列。我得到:
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 11.66824 11.66819 11.66829
2: CNK First Never 101-200 11.37378 12.21008 11.60182
3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330
4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786
5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458
然而,对于实际的data.table
,我有更多的列来计算加权平均值(以及使用更多的权重),一个接一个地进行计算会比较麻烦。我想象的是一个函数,其中每个变量(AVGVALUE1
,AVGVALUE2
等等)的平均值是用每个权重变量(WGT1
,WGT2
,WGT3
等等)计算的,并且计算加权平均值的每个变量的输出被添加到一个列表中。我想列表是最好的选择,因为如果所有估计都在同一个输出中,那么列的数量可能是无限的。比如说:
[[1]]
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 11.66824 11.66819 11.66829
2: CNK First Never 101-200 11.37378 12.21008 11.60182
3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330
4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786
5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458
[[2]]
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 9.132899 9.060045 9.197005
2: CNK First Never 101-200 12.896584 13.278680 13.000772
3: CNK First Sometimes 26-100 10.972260 11.215390 10.828431
4: CNK Second Always MORE THAN 200 11.704404 11.611072 11.749586
5: CNK Second Amost always 11-25 8.086409 8.225030 8.028928
到目前为止,我尝试的是:
使用lappy
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
lapply(mydata, function(i) {
mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
})
Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]], :
'x' and 'w' must have the same length
myfun <- function(data, spl.v, avg.v, wgts) {
data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
na.rm = TRUE), by = c(key(data)[1], spl.v)]
}
mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
wgts = all.weights)
Error: value for ‘AVGVALUE2’ not found
我试图将mget(avg.v)
包装为一个列表-(mget(avg.v))
,但随后出现另一个错误:
Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) :
could not find function "."
有人能帮忙吗?I.lappy
solution
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
myfun <- function(avg.vars){
tmp <-
mydata[ , Map(f = weighted.mean,
x = .(get(avg.vars)),
w = mget(all.weights),
na.rm = TRUE),
by = c(key(mydata)[1], split.vars)]
return(tmp) # totally optional, a habit from using C and Java
}
lapply(avg.vars, myfun)
II<代码>用于循环解决方案
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
myfun <- function(avg.vars){
tmp <-
mydata[ , Map(f = weighted.mean,
x = .(get(avg.vars)),
w = mget(all.weights),
na.rm = TRUE),
by = c(key(mydata)[1], split.vars)]
return(tmp) # totally optional, a habit from using C and Java
}
lapply(avg.vars, myfun)
使用简单的for
循环,例如avg.vars
有2个值:
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
result <- data.frame(matrix(nrow=0,ncol=7))
for(i in avg.vars){
tmp <-
mydata[ , Map(f = weighted.mean,
x = .(get(i)),
w = mget(all.weights),
na.rm = TRUE),
by = c(key(mydata)[1], split.vars)]
result <- rbind(result,tmp,use.names=F)
}
colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
result
正面:
- 在示例中立即完成
- 扩展到任意数量的列,无需额外的数据操作/编码
- 将节省大量的时间一个接一个地进行
- 返回一个漂亮的
数据。表
- 如果您确实想要一个列表,您可以通过将
初始化为列表(return
return我们可以使用
(它对两个输入向量中的值的所有组合执行一个函数)来获得该列表在向量化加权平均值函数上操作。通过在数据表范围内定义outer
使用的函数,我们可以让outer
对数据进行评估。表列:get
这将所有方法放入一列(即“长”格式)。我们还可以添加更多列,以指定每个列所指的值/权重组合:wmeans = mydata[, { f = function(X,Y) weighted.mean(get(X), get(Y)); vf = Vectorize(f); outer(avg.var, all.weights, vf)}, by = split.vars]
我们可以使用wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]] wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]] head(wmeans) # ITNUM SATS ASSETS V1 mean.v mean.w # 1: First Always 0-10 11.668243 AVGVALUE1 WGT1 # 2: First Always 0-10 9.132899 AVGVALUE2 WGT1 # 3: First Always 0-10 11.668192 AVGVALUE1 WGT2 # 4: First Always 0-10 9.060045 AVGVALUE2 WGT2 # 5: First Always 0-10 11.668287 AVGVALUE1 WGT3 # 6: First Always 0-10 9.197005 AVGVALUE2 WGT3
将其重塑为data.table,该data.table在avg.var中较长,但在all.weights中较宽:dcast
如果需要将其作为列表而不是data.table,可以使用wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1") # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE1 First Always 0-10 11.668243 11.668192 11.668287 # 2: AVGVALUE1 First Never 101-200 11.373780 12.210083 11.601819 # 3: AVGVALUE1 First Sometimes 26-100 12.430039 13.134499 12.013299 # 4: AVGVALUE1 Second Always MORE THAN 200 12.322651 11.816135 12.567860 # 5: AVGVALUE1 Second Amost always 11-25 10.765557 11.346688 10.524583 # 6: AVGVALUE2 First Always 0-10 9.132899 9.060045 9.197005 # 7: AVGVALUE2 First Never 101-200 12.896584 13.278680 13.000772 # 8: AVGVALUE2 First Sometimes 26-100 10.972260 11.215390 10.828431 # 9: AVGVALUE2 Second Always MORE THAN 200 11.704404 11.611072 11.749586 #10: AVGVALUE2 Second Amost always 11-25 8.086409 8.225030 8.028928
lapply(avg.var, function(x) wide.wmeans[mean.v == x]) # [[1]] # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE1 First Always 0-10 11.66824 11.66819 11.66829 # 2: AVGVALUE1 First Never 101-200 11.37378 12.21008 11.60182 # 3: AVGVALUE1 First Sometimes 26-100 12.43004 13.13450 12.01330 # 4: AVGVALUE1 Second Always MORE THAN 200 12.32265 11.81613 12.56786 # 5: AVGVALUE1 Second Amost always 11-25 10.76556 11.34669 10.52458 # # [[2]] # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE2 First Always 0-10 9.132899 9.060045 9.197005 # 2: AVGVALUE2 First Never 101-200 12.896584 13.278680 13.000772 # 3: AVGVALUE2 First Sometimes 26-100 10.972260 11.215390 10.828431 # 4: AVGVALUE2 Second Always MORE THAN 200 11.704404 11.611072 11.749586 # 5: AVGVALUE2 Second Amost always 11-25 8.086409 8.225030 8.028928
谢谢,但我发现
(我更喜欢)和lappy
loop`解决方案都存在问题。如果您再添加一列来计算for
的平均值(比如mydata
),然后将其添加到CRMVAR=rnorm(10,10,2)
(avg.vars
avg.vars@panman这太奇怪了。你能用新的示例和预期的输出更新问题,这样我就可以重现并解决问题吗?哦,对不起,这完全是我的错误。我添加了新变量(
)在我的文章开始时,我使用了原始语法,虽然我使用了相同的seed,但其余变量的值发生了变化(我在Linux中使用了R3.3.1),但我将这些值与我已经发布的示例输出中的值进行了比较。一切正常,抱歉造成混淆。CRMVAR
wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]] wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]] head(wmeans) # ITNUM SATS ASSETS V1 mean.v mean.w # 1: First Always 0-10 11.668243 AVGVALUE1 WGT1 # 2: First Always 0-10 9.132899 AVGVALUE2 WGT1 # 3: First Always 0-10 11.668192 AVGVALUE1 WGT2 # 4: First Always 0-10 9.060045 AVGVALUE2 WGT2 # 5: First Always 0-10 11.668287 AVGVALUE1 WGT3 # 6: First Always 0-10 9.197005 AVGVALUE2 WGT3
wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1") # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE1 First Always 0-10 11.668243 11.668192 11.668287 # 2: AVGVALUE1 First Never 101-200 11.373780 12.210083 11.601819 # 3: AVGVALUE1 First Sometimes 26-100 12.430039 13.134499 12.013299 # 4: AVGVALUE1 Second Always MORE THAN 200 12.322651 11.816135 12.567860 # 5: AVGVALUE1 Second Amost always 11-25 10.765557 11.346688 10.524583 # 6: AVGVALUE2 First Always 0-10 9.132899 9.060045 9.197005 # 7: AVGVALUE2 First Never 101-200 12.896584 13.278680 13.000772 # 8: AVGVALUE2 First Sometimes 26-100 10.972260 11.215390 10.828431 # 9: AVGVALUE2 Second Always MORE THAN 200 11.704404 11.611072 11.749586 #10: AVGVALUE2 Second Amost always 11-25 8.086409 8.225030 8.028928
lapply(avg.var, function(x) wide.wmeans[mean.v == x]) # [[1]] # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE1 First Always 0-10 11.66824 11.66819 11.66829 # 2: AVGVALUE1 First Never 101-200 11.37378 12.21008 11.60182 # 3: AVGVALUE1 First Sometimes 26-100 12.43004 13.13450 12.01330 # 4: AVGVALUE1 Second Always MORE THAN 200 12.32265 11.81613 12.56786 # 5: AVGVALUE1 Second Amost always 11-25 10.76556 11.34669 10.52458 # # [[2]] # mean.v ITNUM SATS ASSETS WGT1 WGT2 WGT3 # 1: AVGVALUE2 First Always 0-10 9.132899 9.060045 9.197005 # 2: AVGVALUE2 First Never 101-200 12.896584 13.278680 13.000772 # 3: AVGVALUE2 First Sometimes 26-100 10.972260 11.215390 10.828431 # 4: AVGVALUE2 Second Always MORE THAN 200 11.704404 11.611072 11.749586 # 5: AVGVALUE2 Second Amost always 11-25 8.086409 8.225030 8.028928