Python 如何生成具有特定需求的总体的随机子样本?

Python 如何生成具有特定需求的总体的随机子样本?,python,r,sampling,Python,R,Sampling,假设我有一个年龄和性别混合的群体(可能还有其他属性),我想生成一个具有特定属性的随机子样本(可以替换),例如: 样本量 50%的样本应为年龄编辑: 这是一个样本,70%在30岁以下,20%为男性: N <- 100000 orig_u30 <- 0.7 orig_male <- 0.2 set.seed(42) my_sample <- data.frame(age = sample(c("under 30", "30+"), N

假设我有一个年龄和性别混合的群体(可能还有其他属性),我想生成一个具有特定属性的随机子样本(可以替换),例如:

  • 样本量
  • 50%的样本应为年龄编辑:

    这是一个样本,70%在30岁以下,20%为男性:

    N <- 100000
    orig_u30 <- 0.7
    orig_male <- 0.2
    set.seed(42)
    my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T, 
                                         prob = c(orig_u30, 1 - orig_u30)),
                            gender = sample(c("M", "F"), N, replace = T, 
                                            prob = c(male, 1-male)))
    addmargins(prop.table(table(my_sample$age, my_sample$gender)))
                     F       M     Sum
      30+      0.24292 0.05935 0.30227
      under 30 0.55675 0.14098 0.69773
      Sum      0.79967 0.20033 1.00000
    
    现在,我们对每一行都有一个权重,这将使其趋向于期望的份额:

    library(dplyr)
    my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight)
    
    addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))
    

    原始答案:生成加权样本,但不生成加权子样本

    N <- 1000
    median_age <- 30
    male <- 0.2
    
    my_sample <- data.frame(age = rpois(N, median_age),
               gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))
    
    median(my_sample$age) # will be 30 most runs
    table(my_sample$gender) # will be around 200 / 1000
    

    N
    reweight
    软件包听起来可能有帮助:
                    F      M    Sum
      30+      0.3683 0.2348 0.6031
      under 30 0.2375 0.1594 0.3969
      Sum      0.6058 0.3942 1.0000
    
    N <- 1000
    median_age <- 30
    male <- 0.2
    
    my_sample <- data.frame(age = rpois(N, median_age),
               gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))
    
    median(my_sample$age) # will be 30 most runs
    table(my_sample$gender) # will be around 200 / 1000