Algorithm } $PREVACUM=$r->重量; } 未结算($q[$idxrm]); } 退回$out; }

Algorithm } $PREVACUM=$r->重量; } 未结算($q[$idxrm]); } 退回$out; },algorithm,math,random,statistics,probability,Algorithm,Math,Random,Statistics,Probability,我在这里提供了一个简单的解决方案,用于挑选1个项目,您可以轻松地将其扩展到k个项目(Java风格): double random=Math.random(); 双和=0; 对于(int i=0;i随机){ 所选=val; 打破 } } 如果要从加权集中拾取x个元素而不进行替换,以便以与其权重成比例的概率选择元素: import random def weighted_choose_subset(weighted_set, count): """Return a random sampl

我在这里提供了一个简单的解决方案,用于挑选1个项目,您可以轻松地将其扩展到k个项目(Java风格):

double random=Math.random();
双和=0;
对于(int i=0;i随机){
所选=val;
打破
}
}

如果要从加权集中拾取x个元素而不进行替换,以便以与其权重成比例的概率选择元素:

import random

def weighted_choose_subset(weighted_set, count):
    """Return a random sample of count elements from a weighted set.

    weighted_set should be a sequence of tuples of the form 
    (item, weight), for example:  [('a', 1), ('b', 2), ('c', 3)]

    Each element from weighted_set shows up at most once in the
    result, and the relative likelihood of two particular elements
    showing up is equal to the ratio of their weights.

    This works as follows:

    1.) Line up the items along the number line from [0, the sum
    of all weights) such that each item occupies a segment of
    length equal to its weight.

    2.) Randomly pick a number "start" in the range [0, total
    weight / count).

    3.) Find all the points "start + n/count" (for all integers n
    such that the point is within our segments) and yield the set
    containing the items marked by those points.

    Note that this implementation may not return each possible
    subset.  For example, with the input ([('a': 1), ('b': 1),
    ('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
    'c'] and ['b', 'd'], but it will do so such that the weights
    are respected.

    This implementation only works for nonnegative integral
    weights.  The highest weight in the input set must be less
    than the total weight divided by the count; otherwise it would
    be impossible to respect the weights while never returning
    that element more than once per invocation.
    """
    if count == 0:
        return []

    total_weight = 0
    max_weight = 0
    borders = []
    for item, weight in weighted_set:
        if weight < 0:
            raise RuntimeError("All weights must be positive integers")
        # Scale up weights so dividing total_weight / count doesn't truncate:
        weight *= count
        total_weight += weight
        borders.append(total_weight)
        max_weight = max(max_weight, weight)

    step = int(total_weight / count)

    if max_weight > step:
        raise RuntimeError(
            "Each weight must be less than total weight / count")

    next_stop = random.randint(0, step - 1)

    results = []
    current = 0
    for i in range(count):
        while borders[current] <= next_stop:
            current += 1
        results.append(weighted_set[current][0])
        next_stop += step

    return results
随机导入
def加权_选择_子集(加权_集、计数):
“”“从加权集中返回计数元素的随机样本。
加权_集应该是以下形式的元组序列
(项目,重量),例如:[('a',1),('b',2),('c',3)]
加权_集合中的每个元素在列表中最多显示一次
结果,以及两个特定元素的相对可能性
显示等于他们的重量比。
这项工作如下:
1.)沿[0,总和]的数字行排列项目
(指所有重量)使每件物品占据
长度等于它的重量。
2.)在[0,总计]范围内随机选择一个数字“开始”
重量/计数)。
3.)找到所有点“开始+n/计数”(对于所有整数n
使点位于我们的段内)并生成集合
包含由这些点标记的项目。
请注意,此实现可能不会返回每个可能的值
例如,输入([('a':1),('b':1),
('c':1),('d':1)],2),它只能生成集合['a',
“c']和['b',d'],但这样做会使重量
我们受到尊重。
此实现仅适用于非负积分
权重。输入集中的最高权重必须小于
总重量除以计数;否则
不可能尊重重量,却永远不会回来
每次调用该元素不止一次。
"""
如果计数=0:
返回[]
总重量=0
最大重量=0
边框=[]
对于项目,加权集合中的重量:
如果重量<0:
raise RUNTIMERROR(“所有权重必须为正整数”)
#放大权重,这样除以总权重/计数不会截断:
重量*=计数
总重量+=重量
边框。追加(总重量)
最大重量=最大(最大重量,重量)
步长=整数(总重量/计数)
如果最大重量>步长:
引发运行时错误(
“每个重量必须小于总重量/计数”)
下一站=random.randint(0,步骤-1)
结果=[]
电流=0
对于范围内的i(计数):

虽然borders[current]我知道这是一个非常古老的问题,但如果你运用一点数学知识,我认为有一个巧妙的技巧可以在O(n)时间内做到这一点

有两个非常有用的属性

  • 给定n个具有不同速率参数的不同指数分布的样本,给定样本为最小值的概率等于其速率参数除以所有速率参数之和

  • 它是“无记忆的”。因此,如果您已经知道最小值,那么剩余元素中任何一个是第二个到第二个的概率与删除真实最小值(并且从未生成)的概率相同,该元素将是新的最小值。这似乎很明显,但我认为由于一些条件概率问题,它可能不适用于其他分布

  • 利用事实1,我们知道选择单个元素可以通过生成速率参数等于权重的指数分布样本,然后选择具有最小值的一个来完成

    使用事实2,我们知道我们不必重新生成指数样本。相反,只需为每个元素生成一个,并获取样本最少的k个元素

    可以在O(n)中找到最低k。使用该算法找到第k个元素,然后简单地对所有元素进行另一次遍历,并输出所有低于第k个的元素


    一个有用的提示:如果您不能立即访问库来生成指数分布样本,可以通过以下方法轻松完成:
    -ln(rand())/weight

    我已经实现了一个类似于Jason Orendorff在Rust中的思想的算法。我的版本还支持批量操作:插入和删除(当您想从数据结构中删除由其ID给出的一组项时,而不是通过加权选择路径)在
    O(m+logn)
    time中,m是要删除的项的数量,n是存储的项的数量。

    采样无需递归替换-c中优雅且非常简短的解决方案#

    //我们可以从60个学生中选择4个,这样每次我们都可以选择不同的4个

    class Program
    {
        static void Main(string[] args)
        {
            int group = 60;
            int studentsToChoose = 4;
    
            Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
        }
    
        private static int FindNumberOfStudents(int studentsToChoose, int group)
        {
            if (studentsToChoose == group || studentsToChoose == 0)
                return 1;
    
            return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
    
        }
    }
    

    我只是花了几个小时试图了解无需替换的采样算法,这个主题比我最初想象的更复杂。这很令人兴奋!为了未来读者的利益(祝您愉快!)我在这里记录了我的见解包括一个现成的函数,该函数尊重以下给定的包含概率。可以在这里找到各种方法的简单而快速的数学概述:。例如,Jason的方法可以在第46页找到。他的方法的警告是,权重与文件中也提到了包含概率。实际上,第i个包含概率可以递归计算如下:

    def inclusion_probability(i, weights, k):
        """
            Computes the inclusion probability of the i-th element
            in a randomly sampled k-tuple using Jason's algorithm
            (see https://stackoverflow.com/a/2149533/7729124)
        """
        if k <= 0: return 0
        cum_p = 0
        for j, weight in enumerate(weights):
            # compute the probability of j being selected considering the weights
            p = weight / sum(weights)
    
            if i == j:
                # if this is the target element, we don't have to go deeper,
                # since we know that i is included
                cum_p += p
            else:
                # if this is not the target element, than we compute the conditional
                # inclusion probability of i under the constraint that j is included
                cond_i = i if i < j else i-1
                cond_weights = weights[:j] + weights[j+1:]
                cond_p = inclusion_probability(cond_i, cond_weights, k-1)
                cum_p += p * cond_p
        return cum_p
    

    指定包含概率的一种方法(也在上面的文档中提出)是根据它们计算权重。手头问题的全部复杂性源于这样一个事实,即不能直接这样做,因为基本上必须反转递归公式,从象征意义上说,我声称这是不可能的
    import random
    
    class Node:
        # Each node in the heap has a weight, value, and total weight.
        # The total weight, self.tw, is self.w plus the weight of any children.
        __slots__ = ['w', 'v', 'tw']
        def __init__(self, w, v, tw):
            self.w, self.v, self.tw = w, v, tw
    
    def rws_heap(items):
        # h is the heap. It's like a binary tree that lives in an array.
        # It has a Node for each pair in `items`. h[1] is the root. Each
        # other Node h[i] has a parent at h[i>>1]. Each node has up to 2
        # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple
        # arithmetic, we have to leave h[0] vacant.
        h = [None]                          # leave h[0] vacant
        for w, v in items:
            h.append(Node(w, v, w))
        for i in range(len(h) - 1, 1, -1):  # total up the tws
            h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent
        return h
    
    def rws_heap_pop(h):
        gas = h[1].tw * random.random()     # start with a random amount of gas
    
        i = 1                     # start driving at the root
        while gas >= h[i].w:      # while we have enough gas to get past node i:
            gas -= h[i].w         #   drive past node i
            i <<= 1               #   move to first child
            if gas >= h[i].tw:    #   if we have enough gas:
                gas -= h[i].tw    #     drive past first child and descendants
                i += 1            #     move to second child
        w = h[i].w                # out of gas! h[i] is the selected node.
        v = h[i].v
    
        h[i].w = 0                # make sure this node isn't chosen again
        while i:                  # fix up total weights
            h[i].tw -= w
            i >>= 1
        return v
    
    def random_weighted_sample_no_replacement(items, n):
        heap = rws_heap(items)              # just make a heap...
        for i in range(n):
            yield rws_heap_pop(heap)        # and pop n items off it.
    
    import numpy
    import scipy.interpolate
    
    def weighted_randint(weights, size=None):
        """Given an n-element vector of weights, randomly sample
        integers up to n with probabilities proportional to weights"""
        n = weights.size
        # normalize so that the weights sum to unity
        weights = weights / numpy.linalg.norm(weights, 1)
        # cumulative sum of weights
        cumulative_weights = weights.cumsum()
        # piecewise-linear interpolating function whose domain is
        # the unit interval and whose range is the integers up to n
        f = scipy.interpolate.interp1d(
                numpy.hstack((0.0, weights)),
                numpy.arange(n + 1), kind='linear')
        return f(numpy.random.random(size=size)).astype(int)
    
    require 'pickup'
    pond = {
      "selmon"  => 1,
      "carp" => 4,
      "crucian"  => 3,
      "herring" => 6,
      "sturgeon" => 8,
      "gudgeon" => 10,
      "minnow" => 20
    }
    pickup = Pickup.new(pond, uniq: true)
    pickup.pick(3)
    #=> [ "gudgeon", "herring", "minnow" ]
    pickup.pick
    #=> "herring"
    pickup.pick
    #=> "gudgeon"
    pickup.pick
    #=> "sturgeon"
    
    package foo
    
    import (
        "log"
        "math/rand"
    )
    
    type server struct {
        Weight int
        data   interface{}
    }
    
    func foo(servers []server) {
        // servers list is already sorted by the Weight attribute
    
        // number of items to pick
        max := 4
    
        result := make([]server, max)
    
        sum := 0
        for _, r := range servers {
            sum += r.Weight
        }
    
        for si := 0; si < max; si++ {
            n := rand.Intn(sum + 1)
            s := 0
    
            for i := range servers {
                s += int(servers[i].Weight)
                if s >= n {
                    log.Println("Picked record", i, servers[i])
                    sum -= servers[i].Weight
                    result[si] = servers[i]
    
                    // remove the server from the list
                    servers = append(servers[:i], servers[i+1:]...)
                    break
                }
            }
        }
    
        return result
    }
    
    function getNrandomGuysWithWeight($numitems){
      $q = db_query('SELECT id, weight FROM theTableWithTheData');
      $q = $q->fetchAll();
    
      $accum = 0;
      foreach($q as $r){
        $accum += $r->weight;
        $r->weight = $accum;
      }
    
      $out = array();
    
      while(count($out) < $numitems && count($q)){
        $n = rand(0,$accum);
        $lessaccum = NULL;
        $prevaccum = 0;
        $idxrm = 0;
        foreach($q as $i=>$r){
          if(($lessaccum == NULL) && ($n <= $r->weight)){
            $out[] = $r->id;
            $lessaccum = $r->weight- $prevaccum;
            $accum -= $lessaccum;
            $idxrm = $i;
          }else if($lessaccum){
            $r->weight -= $lessaccum;
          }
          $prevaccum = $r->weight;
        }
        unset($q[$idxrm]);
      }
      return $out;
    }
    
    double random = Math.random();
    double sum = 0;
    for (int i = 0; i < items.length; i++) {
        val = items[i];
        sum += val.getValue();
        if (sum > random) {
            selected = val;
            break;
        }
    }
    
    import random
    
    def weighted_choose_subset(weighted_set, count):
        """Return a random sample of count elements from a weighted set.
    
        weighted_set should be a sequence of tuples of the form 
        (item, weight), for example:  [('a', 1), ('b', 2), ('c', 3)]
    
        Each element from weighted_set shows up at most once in the
        result, and the relative likelihood of two particular elements
        showing up is equal to the ratio of their weights.
    
        This works as follows:
    
        1.) Line up the items along the number line from [0, the sum
        of all weights) such that each item occupies a segment of
        length equal to its weight.
    
        2.) Randomly pick a number "start" in the range [0, total
        weight / count).
    
        3.) Find all the points "start + n/count" (for all integers n
        such that the point is within our segments) and yield the set
        containing the items marked by those points.
    
        Note that this implementation may not return each possible
        subset.  For example, with the input ([('a': 1), ('b': 1),
        ('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
        'c'] and ['b', 'd'], but it will do so such that the weights
        are respected.
    
        This implementation only works for nonnegative integral
        weights.  The highest weight in the input set must be less
        than the total weight divided by the count; otherwise it would
        be impossible to respect the weights while never returning
        that element more than once per invocation.
        """
        if count == 0:
            return []
    
        total_weight = 0
        max_weight = 0
        borders = []
        for item, weight in weighted_set:
            if weight < 0:
                raise RuntimeError("All weights must be positive integers")
            # Scale up weights so dividing total_weight / count doesn't truncate:
            weight *= count
            total_weight += weight
            borders.append(total_weight)
            max_weight = max(max_weight, weight)
    
        step = int(total_weight / count)
    
        if max_weight > step:
            raise RuntimeError(
                "Each weight must be less than total weight / count")
    
        next_stop = random.randint(0, step - 1)
    
        results = []
        current = 0
        for i in range(count):
            while borders[current] <= next_stop:
                current += 1
            results.append(weighted_set[current][0])
            next_stop += step
    
        return results
    
    class Program
    {
        static void Main(string[] args)
        {
            int group = 60;
            int studentsToChoose = 4;
    
            Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
        }
    
        private static int FindNumberOfStudents(int studentsToChoose, int group)
        {
            if (studentsToChoose == group || studentsToChoose == 0)
                return 1;
    
            return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
    
        }
    }
    
    def inclusion_probability(i, weights, k):
        """
            Computes the inclusion probability of the i-th element
            in a randomly sampled k-tuple using Jason's algorithm
            (see https://stackoverflow.com/a/2149533/7729124)
        """
        if k <= 0: return 0
        cum_p = 0
        for j, weight in enumerate(weights):
            # compute the probability of j being selected considering the weights
            p = weight / sum(weights)
    
            if i == j:
                # if this is the target element, we don't have to go deeper,
                # since we know that i is included
                cum_p += p
            else:
                # if this is not the target element, than we compute the conditional
                # inclusion probability of i under the constraint that j is included
                cond_i = i if i < j else i-1
                cond_weights = weights[:j] + weights[j+1:]
                cond_p = inclusion_probability(cond_i, cond_weights, k-1)
                cum_p += p * cond_p
        return cum_p
    
    In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
    0 0.41666666666666663
    1 0.7333333333333333
    2 0.85
    
    In : import collections, itertools
    In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
    In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
    Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
    
    def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
        """
            Returns a random sample of k elements from items, where items is a list of
            tuples (weight, element). The inclusion probability of an element in the
            final sample is given by
               k * weight / sum(weights).
    
            Note that the function raises if a inclusion probability cannot be
            satisfied, e.g the following call is obviously illegal:
               sample_no_replacement_exact([(1,'a'),(2,'b')],2)
            Since selecting two elements means selecting both all the time,
            'b' cannot be selected twice as often as 'a'. In general it can be hard to
            spot if the weights are illegal and the function does *not* always raise
            an exception in that case. To remedy the situation you can pass
            best_effort=True which redistributes the inclusion probability mass
            if necessary. Note that the inclusion probabilities will change
            if deemed necessary.
    
            The algorithm is based on the splitting procedure on page 75/76 in:
            http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
            Additional information can be found here:
            https://stackoverflow.com/questions/2140787/
    
            :param items: list of tuples of type weight,element
            :param k: length of resulting sample
            :param best_effort: fix inclusion probabilities if necessary,
                                (optional, defaults to False)
            :param random_: random module to use (optional, defaults to the
                            standard random module)
            :param ε: fuzziness parameter when testing for zero in the context
                      of floating point arithmetic (optional, defaults to 1e-9)
            :return: random sample set of size k
            :exception: throws ValueError in case of bad parameters,
                        throws AssertionError in case of algorithmic impossibilities
        """
        # random_ defaults to the random submodule
        if not random_:
            random_ = random
    
        # special case empty return set
        if k <= 0:
            return set()
    
        if k > len(items):
            raise ValueError("resulting tuple length exceeds number of elements (k > n)")
    
        # sort items by weight
        items = sorted(items, key=lambda item: item[0])
    
        # extract the weights and elements
        weights, elements = list(zip(*items))
    
        # compute the inclusion probabilities (short: π) of the elements
        scaling_factor = k / sum(weights)
        π = [scaling_factor * weight for weight in weights]
    
        # in case of best_effort: if a inclusion probability exceeds 1,
        # try to rebalance the probabilities such that:
        # a) no probability exceeds 1,
        # b) the probabilities still sum to k, and
        # c) the probability masses flow from top to bottom:
        #    [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
        # (remember that π is sorted)
        if best_effort and π[-1] > 1 + ε:
            # probability mass we still we have to distribute
            debt = 0.
            for i in reversed(range(len(π))):
                if π[i] > 1.:
                    # an 'offender', take away excess
                    debt += π[i] - 1.
                    π[i] = 1.
                else:
                    # case π[i] < 1, i.e. 'save' element
                    # maximum we can transfer from debt to π[i] and still not
                    # exceed 1 is computed by the minimum of:
                    # a) 1 - π[i], and
                    # b) debt
                    max_transfer = min(debt, 1. - π[i])
                    debt -= max_transfer
                    π[i] += max_transfer
            assert debt < ε, "best effort rebalancing failed (impossible)"
    
        # make sure we are talking about probabilities
        if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
            raise ValueError("inclusion probabilities not satisfiable: {}" \
                             .format(list(zip(π, elements))))
    
        # special case equal probabilities
        # (up to fuzziness parameter, remember that π is sorted)
        if π[-1] < π[0] + ε:
            return set(random_.sample(elements, k))
    
        # compute the two possible lambda values, see formula 7 on page 75
        # (remember that π is sorted)
        λ1 = π[0] * len(π) / k
        λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
        λ = min(λ1, λ2)
    
        # there are two cases now, see also page 69
        # CASE 1
        # with probability λ we are in the equal probability case
        # where all elements have the same inclusion probability
        if random_.random() < λ:
            return set(random_.sample(elements, k))
    
        # CASE 2:
        # with probability 1-λ we are in the case of a new sample without
        # replacement problem which is strictly simpler,
        # it has the following new probabilities (see page 75, π^{(2)}):
        new_π = [
            (π_i - λ * k / len(π))
            /
            (1 - λ)
            for π_i in π
        ]
        new_items = list(zip(new_π, elements))
    
        # the first few probabilities might be 0, remove them
        # NOTE: we make sure that floating point issues do not arise
        #       by using the fuzziness parameter
        while new_items and new_items[0][0] < ε:
            new_items = new_items[1:]
    
        # the last few probabilities might be 1, remove them and mark them as selected
        # NOTE: we make sure that floating point issues do not arise
        #       by using the fuzziness parameter
        selected_elements = set()
        while new_items and new_items[-1][0] > 1 - ε:
            selected_elements.add(new_items[-1][1])
            new_items = new_items[:-1]
    
        # the algorithm reduces the length of the sample problem,
        # it is guaranteed that:
        # if λ = λ1: the first item has probability 0
        # if λ = λ2: the last item has probability 1
        assert len(new_items) < len(items), "problem was not simplified (impossible)"
    
        # recursive call with the simpler sample problem
        # NOTE: we have to make sure that the selected elements are included
        return sample_no_replacement_exact(
            new_items,
            k - len(selected_elements),
            best_effort=best_effort,
            random_=random_,
            ε=ε
        ) | selected_elements
    
    In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
    Out: {'b', 'c'}
    
    In : import collections, itertools
    In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
    In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
    Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
    
    In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
    ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]