Python 如何生成具有特定需求的总体的随机子样本?
假设我有一个年龄和性别混合的群体(可能还有其他属性),我想生成一个具有特定属性的随机子样本(可以替换),例如:Python 如何生成具有特定需求的总体的随机子样本?,python,r,sampling,Python,R,Sampling,假设我有一个年龄和性别混合的群体(可能还有其他属性),我想生成一个具有特定属性的随机子样本(可以替换),例如: 样本量 50%的样本应为年龄编辑: 这是一个样本,70%在30岁以下,20%为男性: N <- 100000 orig_u30 <- 0.7 orig_male <- 0.2 set.seed(42) my_sample <- data.frame(age = sample(c("under 30", "30+"), N
- 样本量
- 50%的样本应为年龄编辑:
这是一个样本,70%在30岁以下,20%为男性:
现在,我们对每一行都有一个权重,这将使其趋向于期望的份额:N <- 100000 orig_u30 <- 0.7 orig_male <- 0.2 set.seed(42) my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T, prob = c(orig_u30, 1 - orig_u30)), gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male))) addmargins(prop.table(table(my_sample$age, my_sample$gender))) F M Sum 30+ 0.24292 0.05935 0.30227 under 30 0.55675 0.14098 0.69773 Sum 0.79967 0.20033 1.00000
library(dplyr) my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight) addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))
原始答案:生成加权样本,但不生成加权子样本N <- 1000 median_age <- 30 male <- 0.2 my_sample <- data.frame(age = rpois(N, median_age), gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male))) median(my_sample$age) # will be 30 most runs table(my_sample$gender) # will be around 200 / 1000
N
软件包听起来可能有帮助:reweight
F M Sum 30+ 0.3683 0.2348 0.6031 under 30 0.2375 0.1594 0.3969 Sum 0.6058 0.3942 1.0000
N <- 1000 median_age <- 30 male <- 0.2 my_sample <- data.frame(age = rpois(N, median_age), gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male))) median(my_sample$age) # will be 30 most runs table(my_sample$gender) # will be around 200 / 1000