R:data.table按组计算多个变量的加权平均值,每个变量具有多个权重变量

R:data.table按组计算多个变量的加权平均值,每个变量具有多个权重变量,r,list,data.table,weighted-average,R,List,Data.table,Weighted Average,我还不熟悉数据表。我的问题类似于和。不同的是,我想按组计算多个变量的加权平均值,但对每个平均值使用多个权重 考虑以下数据。表(实际值要大得多): 我通过在中添加了第一个键变量,尽管它是一个常量,因为我希望在输出中将它作为一列。我得到: CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.66824 11.66819 11.

我还不熟悉
数据表。我的问题类似于和。不同的是,我想按组计算多个变量的加权平均值,但对每个平均值使用多个权重

考虑以下
数据。表
(实际值要大得多):

我通过
中添加了第一个键变量,尽管它是一个常量,因为我希望在输出中将它作为一列。我得到:

   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458
然而,对于实际的
data.table
,我有更多的列来计算加权平均值(以及使用更多的权重),一个接一个地进行计算会比较麻烦。我想象的是一个函数,其中每个变量(
AVGVALUE1
AVGVALUE2
等等)的平均值是用每个权重变量(
WGT1
WGT2
WGT3
等等)计算的,并且计算加权平均值的每个变量的输出被添加到一个列表中。我想列表是最好的选择,因为如果所有估计都在同一个输出中,那么列的数量可能是无限的。比如说:

[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928
到目前为止,我尝试的是:
  • 使用
    lappy

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars <- c("AVGVALUE1", "AVGVALUE2")
    split.vars <- c("ITNUM", "SATS", "ASSETS")
    
    lapply(mydata, function(i) {
    mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
    na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
    })
    
    Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]],  : 
     'x' and 'w' must have the same length
    
    myfun <- function(data, spl.v, avg.v, wgts) {
      data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
      na.rm = TRUE), by = c(key(data)[1], spl.v)]
    }
    
    mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
    wgts = all.weights)
    
    Error: value for ‘AVGVALUE2’ not found
    
  • 我试图将
    mget(avg.v)
    包装为一个列表-
    (mget(avg.v))
    ,但随后出现另一个错误:

     Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : 
      could not find function "." 
    

    有人能帮忙吗?

    I.
    lappy
    solution

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
    split.vars  <- c("ITNUM", "SATS", "ASSETS")
    
    myfun <- function(avg.vars){
      tmp <-
        mydata[ , Map(f = weighted.mean, 
                    x = .(get(avg.vars)), 
                    w = mget(all.weights),
                    na.rm = TRUE), 
              by = c(key(mydata)[1], split.vars)]  
    
      return(tmp) # totally optional, a habit from using C and Java
    }
    
    lapply(avg.vars, myfun)
    
    II<代码>用于循环解决方案

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
    split.vars  <- c("ITNUM", "SATS", "ASSETS")
    
    myfun <- function(avg.vars){
      tmp <-
        mydata[ , Map(f = weighted.mean, 
                    x = .(get(avg.vars)), 
                    w = mget(all.weights),
                    na.rm = TRUE), 
              by = c(key(mydata)[1], split.vars)]  
    
      return(tmp) # totally optional, a habit from using C and Java
    }
    
    lapply(avg.vars, myfun)
    
    使用简单的
    for
    循环,例如
    avg.vars
    有2个值:

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
    split.vars  <- c("ITNUM", "SATS", "ASSETS")
    
    result <- data.frame(matrix(nrow=0,ncol=7))
    for(i in avg.vars){
      tmp <- 
        mydata[ , Map(f = weighted.mean, 
                    x = .(get(i)), 
                    w = mget(all.weights),
                    na.rm = TRUE), 
              by = c(key(mydata)[1], split.vars)]  
    
      result <- rbind(result,tmp,use.names=F)
    }
    colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
    result
    
    正面:

    • 在示例中立即完成
    • 扩展到任意数量的列,无需额外的数据操作/编码
    • 将节省大量的时间一个接一个地进行
    • 返回一个漂亮的
      数据。表

    • 如果您确实想要一个列表,您可以通过将
      return
      初始化为列表(
      return我们可以使用
      outer
      (它对两个输入向量中的值的所有组合执行一个函数)来获得该列表在向量化加权平均值函数上操作。通过在数据表范围内定义
      outer
      使用的函数,我们可以让
      get
      对数据进行评估。表列:

      wmeans = mydata[, {
        f  = function(X,Y) weighted.mean(get(X), get(Y));
        vf = Vectorize(f);
        outer(avg.var, all.weights, vf)},
        by = split.vars]
      
      这将所有方法放入一列(即“长”格式)。我们还可以添加更多列,以指定每个列所指的值/权重组合:

      wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]]       
      wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]]
      head(wmeans)
      #    ITNUM   SATS ASSETS        V1    mean.v mean.w
      # 1: First Always   0-10 11.668243 AVGVALUE1   WGT1
      # 2: First Always   0-10  9.132899 AVGVALUE2   WGT1
      # 3: First Always   0-10 11.668192 AVGVALUE1   WGT2
      # 4: First Always   0-10  9.060045 AVGVALUE2   WGT2
      # 5: First Always   0-10 11.668287 AVGVALUE1   WGT3
      # 6: First Always   0-10  9.197005 AVGVALUE2   WGT3
      
      我们可以使用
      dcast
      将其重塑为data.table,该data.table在avg.var中较长,但在all.weights中较宽:

      wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1")  
      #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
      # 1: AVGVALUE1  First       Always          0-10 11.668243 11.668192 11.668287
      # 2: AVGVALUE1  First        Never       101-200 11.373780 12.210083 11.601819
      # 3: AVGVALUE1  First    Sometimes        26-100 12.430039 13.134499 12.013299
      # 4: AVGVALUE1 Second       Always MORE THAN 200 12.322651 11.816135 12.567860
      # 5: AVGVALUE1 Second Amost always         11-25 10.765557 11.346688 10.524583
      # 6: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
      # 7: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
      # 8: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
      # 9: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      #10: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928
      
      如果需要将其作为列表而不是data.table,可以使用

      lapply(avg.var, function(x) wide.wmeans[mean.v == x])
      # [[1]]
      #       mean.v  ITNUM         SATS        ASSETS     WGT1     WGT2     WGT3
      # 1: AVGVALUE1  First       Always          0-10 11.66824 11.66819 11.66829
      # 2: AVGVALUE1  First        Never       101-200 11.37378 12.21008 11.60182
      # 3: AVGVALUE1  First    Sometimes        26-100 12.43004 13.13450 12.01330
      # 4: AVGVALUE1 Second       Always MORE THAN 200 12.32265 11.81613 12.56786
      # 5: AVGVALUE1 Second Amost always         11-25 10.76556 11.34669 10.52458
      # 
      # [[2]]
      #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
      # 1: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
      # 2: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
      # 3: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
      # 4: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      # 5: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928
      

      谢谢,但我发现
      lappy
      (我更喜欢)和
      for
      loop`解决方案都存在问题。如果您再添加一列来计算
      mydata
      的平均值(比如
      CRMVAR=rnorm(10,10,2)
      ),然后将其添加到
      avg.vars
      avg.vars@panman这太奇怪了。你能用新的示例和预期的输出更新问题,这样我就可以重现并解决问题吗?哦,对不起,这完全是我的错误。我添加了新变量(
      CRMVAR
      )在我的文章开始时,我使用了原始语法,虽然我使用了相同的seed,但其余变量的值发生了变化(我在Linux中使用了R3.3.1),但我将这些值与我已经发布的示例输出中的值进行了比较。一切正常,抱歉造成混淆。
      wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]]       
      wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]]
      head(wmeans)
      #    ITNUM   SATS ASSETS        V1    mean.v mean.w
      # 1: First Always   0-10 11.668243 AVGVALUE1   WGT1
      # 2: First Always   0-10  9.132899 AVGVALUE2   WGT1
      # 3: First Always   0-10 11.668192 AVGVALUE1   WGT2
      # 4: First Always   0-10  9.060045 AVGVALUE2   WGT2
      # 5: First Always   0-10 11.668287 AVGVALUE1   WGT3
      # 6: First Always   0-10  9.197005 AVGVALUE2   WGT3
      
      wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1")  
      #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
      # 1: AVGVALUE1  First       Always          0-10 11.668243 11.668192 11.668287
      # 2: AVGVALUE1  First        Never       101-200 11.373780 12.210083 11.601819
      # 3: AVGVALUE1  First    Sometimes        26-100 12.430039 13.134499 12.013299
      # 4: AVGVALUE1 Second       Always MORE THAN 200 12.322651 11.816135 12.567860
      # 5: AVGVALUE1 Second Amost always         11-25 10.765557 11.346688 10.524583
      # 6: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
      # 7: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
      # 8: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
      # 9: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      #10: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928
      
      lapply(avg.var, function(x) wide.wmeans[mean.v == x])
      # [[1]]
      #       mean.v  ITNUM         SATS        ASSETS     WGT1     WGT2     WGT3
      # 1: AVGVALUE1  First       Always          0-10 11.66824 11.66819 11.66829
      # 2: AVGVALUE1  First        Never       101-200 11.37378 12.21008 11.60182
      # 3: AVGVALUE1  First    Sometimes        26-100 12.43004 13.13450 12.01330
      # 4: AVGVALUE1 Second       Always MORE THAN 200 12.32265 11.81613 12.56786
      # 5: AVGVALUE1 Second Amost always         11-25 10.76556 11.34669 10.52458
      # 
      # [[2]]
      #       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
      # 1: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
      # 2: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
      # 3: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
      # 4: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
      # 5: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928