R数据表：分组的加权百分比_R_Data.table_Grouping

R数据表：分组的加权百分比
R数据表：分组的加权百分比,r,data.table,grouping,R,Data.table,Grouping,我有一个数据。表类似于： library(data.table) widgets <- data.table(serial_no=1:100, color=rep_len(c("red","green","blue","black"),length.out=100), style=rep_len(c("round","pointy","flat"),length.out=100),
我有一个
数据。表类似于：
library(data.table)
widgets <- data.table(serial_no=1:100, 
                      color=rep_len(c("red","green","blue","black"),length.out=100),
                      style=rep_len(c("round","pointy","flat"),length.out=100),
                      weight=rep_len(1:5,length.out=100) )

但我不能用这种方法来回答诸如“按重量计算，红色小部件的圆形百分比是多少？”之类的问题，我只能提出两步方法：
# example B
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color]

#    color  style style_pct_of_color_by_weight
# 1:   red  round                    0.3466667
# 2:   red pointy                    0.3466667
# 3:   red   flat                    0.3066667
# 4: green pointy                    0.3333333
# ...

我正在寻找一种实现B的单步方法，如果可以改进的话，在解释中可以加深我对分组操作的data.table
语法的理解。请注意，这个问题与我的问题不同，因为我的问题涉及子组，并且避免了多个步骤。TYVM.
这几乎是一个步骤：
# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

工作原理：先为顶级组（color
）构造分母，然后再转到更精细的组（color
withstyle
）制表

替代品。如果在每个颜色中重复样式
，并且这仅用于显示目的，请尝试使用表格
：
# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

对于B，这将扩展数据，以便每个重量单位有一个观测值。对于大数据，这样的扩展将是一个坏主意（因为它需要占用大量内存）。另外，weight
必须是一个整数；否则，其总和将被自动截断为1（例如，尝试rep（1,2.5）#[1]11
。
使用dplyr

df <- widgets %>% 
  group_by(color, style) %>%
  summarise(count = n()) %>%
  mutate(freq = count/sum(count))

df2 <- widgets %>% 
  group_by(color, style) %>%
  summarise(count_w = sum(weight)) %>%
  mutate(freq = count_w/sum(count_w))  

df%
分组依据（颜色、样式）%>%
汇总（计数=n（））%>%
变异（频率=计数/总和（计数））
df2%
分组依据（颜色、样式）%>%
总结（计数w=总和（重量））%>%
变异（频率=计数w/和（计数w））
为颜色中的每个样式
计算一个频率表，然后为每行查找该表中该行的样式
的频率，最后除以该颜色
中的行数
widgets[, frac := table(style)[style] / .N, by = color]

给予：
  > widgets
     serial_no color  style weight frac
  1:         1   red  round      1 0.36
  2:         2 green pointy      2 0.36
  3:         3  blue   flat      3 0.36
  4:         4 black  round      4 0.36
  5:         5   red pointy      5 0.32
  6:         6 green   flat      1 0.32
  7:         7  blue  round      2 0.32
  8:         8 black pointy      3 0.32
  9:         9   red   flat      4 0.32
 10:        10 green  round      5 0.32
 ... etc ...

如果需要，可以将其转换为base或dplyr：
# base
prop <- function(x) table(x)[x] / length(x)
transform(widgets, frac = ave(style, color, FUN = prop))

# dplyr - uses prop function from above
library(dplyr)
widgets %>% group_by(color) %>% mutate(frac = prop(style)) %>% ungroup

#基本
属性%group\U by（颜色）%%>%mutate（分形=属性（样式））%%>%ungroup
我就是这么做的，但我也有兴趣找到一个更好的方法。谢谢@Frank——我需要一段时间来摸索点符号和嵌入的赋值，但这是一个很好的方法。您的第一个版本可以在没有临时变量的情况下重写，如下所示：widgets[，（frac=.SD[，.N，by=style]$N/.N），by=color]
@Arun样式
列也应该在结果中。查看下面@Frank的回复，我注意到我的尝试不仅笨拙而且不正确——例如，我检查了小部件[，sum（style==“round”&color==“red”）/sum（color==“red”）]#0.36
谢谢@drsh1我很欣赏dplyr
在这里的直观性和实用性。我的问题是如何使用data.table语法。
# base
prop <- function(x) table(x)[x] / length(x)
transform(widgets, frac = ave(style, color, FUN = prop))

# dplyr - uses prop function from above
library(dplyr)
widgets %>% group_by(color) %>% mutate(frac = prop(style)) %>% ungroup