Python随机样本生成器（适应庞大的人口规模）_Python_Random_Generator_Lazy Evaluation_Sample

Python随机样本生成器（适应庞大的人口规模）

python random

Python随机样本生成器（适应庞大的人口规模）,python,random,generator,lazy-evaluation,sample,Python,Random,Generator,Lazy Evaluation,Sample,正如您可能知道的那样，random.sample（总体、样本大小）会快速返回一个随机样本，但是如果您事先不知道样本的大小，该怎么办？你最终会对整个人群进行抽样，或者对其进行洗牌，这是一样的。但这可能是浪费（如果大多数样本大小与总体大小相比很小）甚至不可行（如果总体大小很大，内存不足）。另外，如果您的代码需要在选择示例的下一个元素之前从这里跳到那里，该怎么办顺便说一句，我在为客户工作的过程中遇到了优化随机样本的需要。在我的代码中，采样被重新启动了数十万次，每次我都不知道是需要选择一个元素还是10

正如您可能知道的那样，

random.sample（总体、样本大小）

会快速返回一个随机样本，但是如果您事先不知道样本的大小，该怎么办？你最终会对整个人群进行抽样，或者对其进行洗牌，这是一样的。但这可能是浪费（如果大多数样本大小与总体大小相比很小）甚至不可行（如果总体大小很大，内存不足）。另外，如果您的代码需要在选择示例的下一个元素之前从这里跳到那里，该怎么办

顺便说一句，我在为客户工作的过程中遇到了优化随机样本的需要。在我的代码中，采样被重新启动了数十万次，每次我都不知道是需要选择一个元素还是100%的填充元素。

首先，我会将填充划分为块。块采样的函数可以很容易地作为生成器，能够处理任意大小的样本。这也允许您将函数设置为生成器

想象一下无限的人口，512的人口块和8的样本大小。这意味着您可以根据需要收集尽可能多的样本，为了将来的缩减，请再次对已采样的空间进行采样（对于1024个块，这意味着您可以从中再次采样8196个样本）

同时，这允许并行处理，这在非常大的样本情况下可能是可行的

考虑内存填充的示例

import random

population = [random.randint(0, 1000) for i in range(0, 150000)]

def sample_block(population, block_size, sample_size):
    block_number = 0
    while 1:
        try:
            yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
            block_number += 1
        except ValueError:
            break

sampler = sample_block(population, 512, 8)
samples = []

try:
    while 1:
        samples.extend(sampler.next())
except StopIteration:
    pass

print random.sample(samples, 200)

如果填充在脚本（文件、块）外部，唯一的修改是必须将适当的块加载到内存中。无限总体抽样的概念证明：

import random
import time

def population():
    while 1:
        yield random.randint(0, 10000)

def reduced_population(samples):
    for sample in samples:
        yield sample

def sample_block(generator, block_size, sample_size):

    block_number = 0
    block = []
    while 1:
        block.append(generator.next())
        if len(block) == block_size:
            s = random.sample(block, sample_size)
            block_number += 1
            block = []
            print 'Sampled block {} with result {}.'.format(block_number, s)
            yield s

samples = []
result = []
reducer = sample_block(population(), 512, 12)

try:
    while 1:
        samples.append(reducer.next())
        if len(samples) == 1000:
            sampler = sample_block(reduced_population(samples), 1000, 15)
            result.append(list(sampler))
            time.sleep(5)
except StopIteration:
    pass

理想情况下，您还可以收集样本并再次对其进行采样。

我（在Python 2.7.9中）编写了一个随机采样生成器（索引），其速度仅取决于样本大小（它应该是

O（ns log（ns））

其中

ns

是样本大小）。因此，与总体规模相比，当样本规模较小时，它是快速的，因为它根本不依赖于总体规模。它不构建任何总体集合，只选择随机索引，并对采样的索引使用一种对分方法，以避免重复并保持排序。给定一个iterable

总体

，下面介绍如何使用

itersample

生成器：

import random
sampler=itersample(len(population))
next_pick=sampler.next() # pick the next random (index of) element

或

如果您需要实际的元素，而不仅仅是索引，那么只需在需要时将

population

iterable应用于索引（

population[sampler.next（）]

和

population[index]

，分别用于第一个和第二个示例）

一些测试的结果表明，速度并不取决于总体大小，因此，如果您需要从1000亿总体中随机选取10个元素，您只需支付10个（请记住，我们事先不知道将选取多少元素，否则您最好使用

random.sample

）

其他测试证实，运行时间与样本量的线性关系略大于：

Sampling 100 from 1000000000
Using itersample 0.0018 s

Sampling 1000 from 1000000000
Using itersample 0.0294 s

Sampling 10000 from 1000000000
Using itersample 0.4438 s

Sampling 100000 from 1000000000
Using itersample 8.8739 s

最后，这里是生成器函数

itersample

：

import random
def itersample(c): # c: population size
    sampled=[]
    def fsb(a,b): # free spaces before middle of interval a,b
        fsb.idx=a+(b+1-a)/2
        fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
        return fsb.last
    while len(sampled)<c:
        sample_index=random.randrange(c-len(sampled))
        a,b=0,len(sampled)-1
        if fsb(a,a)>sample_index:
            yielding=sample_index
            sampled.insert(0,yielding)
            yield yielding
        elif fsb(b,b)<sample_index+1:
            yielding=len(sampled)+sample_index
            sampled.insert(len(sampled),yielding)
            yield yielding
        else: # sample_index falls inside sampled list
            while a+1<b:
                if fsb(a,b)<sample_index+1:
                    a=fsb.idx
                else:
                    b=fsb.idx
            yielding=a+1+sample_index
            sampled.insert(a+1,yielding)
            yield yielding

随机导入
定义itersample（c）：#c：人口规模
抽样=[]
def fsb（a，b）：#间隔a，b中间前的可用空间
fsb.idx=a+（b+1-a）/2
fsb.last=sampled[fsb.idx]-fsb.idx，如果len（sampled）>0，则为0
最后返回fsb
而len（抽样）样本指数：
屈服=样本指数
采样。插入（0，屈服）
产量
我相信这就是发电机的用途。下面是通过生成器/产量进行Fisher-Yates Knuth采样的示例，您可以一个接一个地获取事件，并在需要时停止
代码更新
import random
import numpy
import array

class populationFYK(object):
    """
    Implementation of the Fisher-Yates-Knuth shuffle
    """
    def __init__(self, population):
        self._population = population      # reference to the population
        self._length     = len(population) # lengths of the sequence
        self._index      = len(population)-1 # last unsampled index
        self._popidx     = array.array('i', range(0,self._length))

        # array module vs numpy
        #self._popidx     = numpy.empty(self._length, dtype=numpy.int32)
        #for k in range(0,self._length):
        #    self._popidx[k] = k


    def swap(self, idx_a, idx_b):
        """
        Swap two elements in population
        """
        temp = self._popidx[idx_a]
        self._popidx[idx_a] = self._popidx[idx_b]
        self._popidx[idx_b] = temp

    def sample(self):
        """
        Yield one sampled case from population
        """
        while self._index >= 0:
            idx = random.randint(0, self._index) # index of the sampled event

            if idx != self._index:
                self.swap(idx, self._index)

            sampled = self._population[self._popidx[self._index]] # yielding it

            self._index -= 1 # one less to be sampled

            yield sampled

    def index(self):
        return self._index

    def restart(self):
        self._index = self._length - 1
        for k in range(0,self._length):
            self._popidx[k] = k

if __name__=="__main__":
    population = [1,3,6,8,9,3,2]

    gen = populationFYK(population)

    for k in gen.sample():
        print(k)

通过选取[0…N]范围内的K个非重复随机数并将其作为索引，可以从大小为N的总体中获得大小为K的样本
选项a）
您可以使用众所周知的sample方法生成这样一个索引样本
random.sample(xrange(N), K)

从：
要从一系列整数中选择一个样本，请使用xrange（）对象作为参数。这对于从大量总体中进行采样尤其快速且节省空间
import random

population = [random.randint(0, 1000) for i in range(0, 150000)]

def sample_block(population, block_size, sample_size):
    block_number = 0
    while 1:
        try:
            yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
            block_number += 1
        except ValueError:
            break

sampler = sample_block(population, 512, 8)
samples = []

try:
    while 1:
        samples.extend(sampler.next())
except StopIteration:
    pass

print random.sample(samples, 200)

选项b）
如果您不喜欢random.sample已经返回了一个列表而不是一个非重复随机数的惰性生成器，那么您可以尝试加密计数器
这样，您就可以得到一个真正的随机索引生成器，您可以选择任意数量的随机索引，并随时停止，而不会得到任何重复的索引，这将为您提供动态大小的样本集
我们的想法是构造一个加密方案来加密从0到N的数字。现在，每当您想要从您的人口中获取样本时，您都会选择一个随机密钥进行加密，并从0、1、2……开始加密数字（这是计数器）。由于每一个好的加密都会创建一个看起来随机的1:1映射，因此最终会得到非重复的随机整数，您可以将其用作索引。
延迟生成过程中的存储需求只是初始密钥加上计数器的当前值
这个想法已经在中讨论过了。甚至还有一个python代码段链接：
使用此代码段的示例代码可以如下实现：
def itersample(population):
    # Get the size of the population
    N = len(population)
    # Get the number of bits needed to represent this number
    bits = (N-1).bit_length()
    # Generate some random key
    key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
    # Create a new crypto instance that encrypts binary blocks of width <bits>
    # Thus, being able to encrypt all numbers up to the nearest power of two
    crypter = FPEInteger(key=key, radix=2, width=bits)

    # Count up 
    for i in xrange(1<<bits):
        # Encrypt the current counter value
        x = crypter.encrypt(i)
        # If it is bigger than our population size, just skip it
        # Since we generate numbers up to the nearest power of 2, 
        # we have to skip up to half of them, and on average up to one at a time
        if x < N:
            # Return the randomly chosen element
            yield population[x]

def itersample（总体）：
#了解人口数量
N=len（总体）
#获取表示此数字所需的位数
位=（N-1）。位长度（）
#生成一些随机密钥
key=''.join（随机.choice（string.ascii_字母+string.digits）表示范围（32）内的u）
#创建一个新的加密实例，对宽度为的二进制块进行加密
#因此，能够将所有数字加密到最接近的二次方
密码器=FPEInteger（键=键，基数=2，宽度=位）
#计算
对于xrange（1中的i，这里有另一个想法。因此，对于庞大的人口，我们希望保留有关所选记录的一些信息。在您的ca中
def itersample(population):
    # Get the size of the population
    N = len(population)
    # Get the number of bits needed to represent this number
    bits = (N-1).bit_length()
    # Generate some random key
    key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
    # Create a new crypto instance that encrypts binary blocks of width <bits>
    # Thus, being able to encrypt all numbers up to the nearest power of two
    crypter = FPEInteger(key=key, radix=2, width=bits)

    # Count up 
    for i in xrange(1<<bits):
        # Encrypt the current counter value
        x = crypter.encrypt(i)
        # If it is bigger than our population size, just skip it
        # Since we generate numbers up to the nearest power of 2, 
        # we have to skip up to half of them, and on average up to one at a time
        if x < N:
            # Return the randomly chosen element
            yield population[x]