Python 模拟从袋子中拉出弹珠而无需更换（有效）_Python_Random Sample

Python 模拟从袋子中拉出弹珠而无需更换（有效）

python

Python 模拟从袋子中拉出弹珠而无需更换（有效）,python,random-sample,Python,Random Sample,我需要在python中模拟超几何分布（用于采样元素的花哨词语，不需要替换）设置：有一个装满许多弹珠的袋子。有两种类型的大理石，红色和绿色（在以下实现中，大理石表示为True和False）。从袋子中取出的弹珠数量为样品数量下面是我针对这个问题提出的两个实现，但是它们都在人口>10^8时开始降低速度 def pull_marbles(sample, population=100): assert population % 2 == 0 marbles = [x < popu

我需要在python中模拟超几何分布（用于采样元素的花哨词语，不需要替换）

设置：有一个装满许多弹珠的袋子。有两种类型的大理石，红色和绿色（在以下实现中，大理石表示为True和False）。从袋子中取出的弹珠数量为样品数量

下面是我针对这个问题提出的两个实现，但是它们都在人口>10^8时开始降低速度

def pull_marbles(sample, population=100):
    assert population % 2 == 0
    marbles = [x < population / 2 for x in range(0,population)]
    chosen = []
    for i in range(0,sample):
        choice = random.randint(0, population - i - 1)
        chosen.append(marbles[choice])
        del marbles[choice]
    return marbles

def pull_大理石（样本，总体=100）：
断言填充%2==0
大理石=[x<总体/2，对于范围内的x（0，总体）]
选择=[]
对于范围内的i（0，样本）：
choice=random.randint（0，总体-i-1）
已选择。追加（大理石[选择]）
德尔大理石[选择]
返回弹珠

此实现非常可读，并且清楚地遵循问题的设置。然而，它必须创建一个人口规模列表，这似乎是瓶颈

def pull_marbles2(sample, population=100):
    assert population % 2 == 0
    return random.sample([x < population / 2 for x in range(0, population)], sample)

def pull_marbles2（样本，总体=100）：
断言填充%2==0
返回随机样本（[x


这个实现使用了random.sample函数，希望能加快速度。不幸的是，它没有解决生成长度填充列表的潜在瓶颈
EDIT:错误地，第一个代码示例返回了大理石，这使得这个问题模棱两可。因此，明确地说，我希望代码返回被“拉”的红色弹珠和绿色弹珠的数量。很抱歉造成混淆-我将保留拉_弹珠的原始错误版本，但不会使现有答案看起来无效
 这个列表似乎没有必要。试着这样做：
def pull_marbles(sample, population=100):
    assert population % 2 == 0
    marbles = [x < population / 2 for x in range(0,population)]
    total_chosen = 0 # number of times you sampled it. this would always == population but included for clarity
    true_chosen = 0 # number of samples that were True
    for i in range(0,sample):
        choice = random.randint(0, population - i - 1)
        if marbles[choice]: true_chosen += 1
        total_chosen += 1
        del marbles[choice]
    return true_chosen, total_chosen

def pull_大理石（样本，总体=100）：
断言填充%2==0
大理石=[x<总体/2，对于范围内的x（0，总体）]
选择的总次数=0次。这将始终==总体，但为了清楚起见，将其包括在内
true_Selected=0#为true的样本数
对于范围内的i（0，样本）：
choice=random.randint（0，总体-i-1）
如果弹珠[choice]：true_Selected+=1
所选总数+=1
德尔大理石[选择]
返回选择的正确值，选择的总值

这将返回两个整数，其中比率是为真的数字
这需要与样本
成比例的时间（而不是与总体
）。虽然您没有这样说，但您的代码似乎假设袋子中的每种颜色的大理石数量相等。这里的代码如下所示，但可以很容易地使用一些其他假设：
def pull_marbles(sample, population=100):
    from random import random
    assert population % 2 == 0
    chosen = []
    nTrue = population / 2.0
    nTotal = float(population)
    for _ in xrange(sample):
        if random() < nTrue / nTotal:
            chosen.append(True)
            nTrue -= 1.0
        else:
            chosen.append(False)
        nTotal -= 1.0
    return chosen

def pull_大理石（样本，总体=100）：
从随机导入随机
断言填充%2==0
选择=[]
nTrue=人口/2.0
nTotal=浮动（总体）
对于X范围内的u（样本）：
如果随机（）
选中。追加（True）
nTrue-=1.0
其他：
已选择。追加（False）
nTotal-=1.0
选择返回

你不需要一个完整的清单。。。
只要在你的变量中随机选择一个符合理论列表的概率
附带说明
marbles = [x < population/2 for x in range(population)]  # SLOW
#takes  69 us with population of 1k
#takes memoryerror with population of 10^8 (2.5 seconds for 1/8th of the 10^8 population)
marbles = [False]*(population/2) + [True]*(population/2) #much FASTER!!!
#takes 8.6 us for population of 1k
#takes 272 ms for half the list so about 544 ms total
marbles = [True,False]*(population/2) #fastest ...
#2.19 us with population of 1k
#329 ms with population of 10^8

marbles=[x
不要用列表来表示你的包，只需使用两个整数来计算红色和绿色的大理石。每次拉动都是通过检查范围（0..red+green）
的随机数是否小于red
。如果是，则拉动红色，因此减小红色
，否则拉动绿色，因此减小绿色

这样，您将不得不迭代地执行所有拉取操作，但我想这不应该是一个问题。但是可能有一些我现在想不到的优化，可以在不需要迭代的情况下获取大量数据
def pull_marbles(sample, population=100):
  red = population / 2
  green = (population+1) / 2  # round up just to ensure red+green == population
  for i in range(sample):
    choice = random.randint(1, red + green)
    if choice <= red:  # red pulled
      red -= 1
    else:
      green -= 1
  return (red, green)

def pull_大理石（样本，总体=100）：
红色=人口/2
绿色=（人口+1）/2#四舍五入以确保红色+绿色==人口
对于范围内的i（样本）：
choice=random.randint（1，红色+绿色）
如果选择我的两个比特-与其他比特相似。计算选择每种颜色的概率，然后将其与随机数进行比较-累积选择
import random
from operator import itemgetter

least_probable = color = itemgetter(0)
most_probable = probability = itemgetter(1)

def select(pop, samp):
    assert pop % 2 == 0 and samp < pop
    choices = (random.random() for _ in xrange(samp))
##    choices = (random.uniform(0.0, 1.0) for _ in xrange(samp))
##    choices = (random.triangular() for _ in xrange(samp))
    num_red = num_green = 0    
    total_red = total_green = pop / 2.0
    for choice in choices:
        p_red = total_red / pop
        p_green = total_green / pop
        marbles = [('RED', p_red), ('GREEN', p_green)]
        marbles.sort(key = probability)
        if choice <= probability(least_probable(marbles)):
            marble = color(least_probable(marbles))
        else:
            marble = color(most_probable(marbles))
        if marble is 'RED':
            num_red += 1
            total_red -= 1
        else:
            num_green += 1
            total_green -= 1
        pop -= 1
##        print marbles, choice, marble
    return ('RED', num_red), ('GREEN', num_green)

for thing in (select(100000000, 1000) for _ in xrange(20)):
    print thing

随机导入
从运算符导入itemgetter
最小概率=color=itemgetter（0）
最大概率=概率=项目获取者（1）
def选择（pop、samp）：
断言pop%2==0且samp如果选择这个问题似乎是离题的，因为它更适合@senshin，呃，我对这个问题背后的统计数据并不太熟悉（我被要求为一位想教孩子统计数据的朋友做这件事）。你能详细解释一下你的意思吗？@JaneDoe啊，对不起，再想想，我提出的想法在这里没有用。离散分布上的拒绝采样显然很棘手，所以是的，别介意我。如果瓶颈真的存在
import random
from operator import itemgetter

least_probable = color = itemgetter(0)
most_probable = probability = itemgetter(1)

def select(pop, samp):
    assert pop % 2 == 0 and samp < pop
    choices = (random.random() for _ in xrange(samp))
##    choices = (random.uniform(0.0, 1.0) for _ in xrange(samp))
##    choices = (random.triangular() for _ in xrange(samp))
    num_red = num_green = 0    
    total_red = total_green = pop / 2.0
    for choice in choices:
        p_red = total_red / pop
        p_green = total_green / pop
        marbles = [('RED', p_red), ('GREEN', p_green)]
        marbles.sort(key = probability)
        if choice <= probability(least_probable(marbles)):
            marble = color(least_probable(marbles))
        else:
            marble = color(most_probable(marbles))
        if marble is 'RED':
            num_red += 1
            total_red -= 1
        else:
            num_green += 1
            total_green -= 1
        pop -= 1
##        print marbles, choice, marble
    return ('RED', num_red), ('GREEN', num_green)

for thing in (select(100000000, 1000) for _ in xrange(20)):
    print thing