使用dplyr和样本n根据权重随机抽样
我想在一个单独的数据框架中,根据指数给出的一组权重,对月份进行随机抽样,但指数会根据一些分类变量而变化 下面是一个示例问题:使用dplyr和样本n根据权重随机抽样,r,dplyr,random-sample,R,Dplyr,Random Sample,我想在一个单独的数据框架中,根据指数给出的一组权重,对月份进行随机抽样,但指数会根据一些分类变量而变化 下面是一个示例问题: require(dplyr) sim.size <- 1000 # Generating the weights for each month, and category combination class_probs <- data_frame(categoryA=rep(letters[1:3],24)
require(dplyr)
sim.size <- 1000
# Generating the weights for each month, and category combination
class_probs <- data_frame(categoryA=rep(letters[1:3],24)
categoryB=rep(LETTERS[1:2],each=36),
Month=rep(month.name,6),
MonthIndex=runif(72))
# Generating some randomly simulated cateogories
sim.data <- data_frame(categoryA=sample(letters[1:3],size=sim.size,replace=TRUE),
categoryB=sample(LETTERS[1:2],size=sim.size,replace=TRUE))
# This is where i need help
# I would like to add an extra column called Month on the end of sim.data
# That will be sampled using the class_probs data, taking into account the
# Both categoryA and categoryB to generate the weights in MonthIndex
sim.data %>%
group_by(categoryA,categoryB) %>%
do(sample_n(class_probs[class_probs$categoryA==categoryA &
class_probs$categoryB==categoryB, ],
size=nrow(sim.data[sim.data$categoryA==categoryA &
sim.data$categoryB==categoryB]),
replace=TRUE,
weight=MonthIndex)$Month)
require(dplyr)
sim.size这里有一种方法,使用helper函数进行采样,然后使用一个简单的mutate
调用dplyr
来创建新列
辅助功能:
sampler <- function(x, y, df) {
tab <- sample_n(df %>% filter(categoryA==x,
categoryB==y),
size=1,
replace=TRUE,
weight=MonthIndex)
return(tab$Month)
}
sim.data %>%
rowwise() %>%
mutate(month = sampler(categoryA, categoryB, class_probs))
Source: local data frame [1,000 x 3]
Groups: <by row>
categoryA categoryB month
1 b B February
2 b A February
3 b B May
4 c B December
5 c B June
6 b A August
7 c A March
8 c A September
9 b A August
10 c A December
.. ... ... ...
结果:
sampler <- function(x, y, df) {
tab <- sample_n(df %>% filter(categoryA==x,
categoryB==y),
size=1,
replace=TRUE,
weight=MonthIndex)
return(tab$Month)
}
sim.data %>%
rowwise() %>%
mutate(month = sampler(categoryA, categoryB, class_probs))
Source: local data frame [1,000 x 3]
Groups: <by row>
categoryA categoryB month
1 b B February
2 b A February
3 b B May
4 c B December
5 c B June
6 b A August
7 c A March
8 c A September
9 b A August
10 c A December
.. ... ... ...
来源:本地数据帧[1000 x 3]
组:
分类月份
1b 2月b日
二月二日
3月b日
4年12月B日
6月5日
6b每年8月
7 c游行
8 c九月一日
8月9日b
每年12月10日
.. ... ... ...
此行末尾需要一个逗号:class\u probs