C++ 为什么人们说使用随机数生成器时存在模偏差？_C++_Random_Language Agnostic_Modulo

C++ 为什么人们说使用随机数生成器时存在模偏差？

c++ random language-agnostic

C++ 为什么人们说使用随机数生成器时存在模偏差？,c++,random,language-agnostic,modulo,C++,Random,Language Agnostic,Modulo,我看到这个问题被问了很多次，但从来没有看到一个真正的具体答案。因此，我将在这里发表一篇文章，希望能帮助人们理解为什么在使用C++中的随机数生成器，例如“代码> RAND（））/代码>时，存在“模数偏倚”。 > P> SO（代码> RAND）（）是一个伪随机数发生器，它选择0和->代码> RANDMAX 之间的自然数，这是在cstdlib中定义的常量（有关rand（）现在，如果你想生成一个介于0和2之间的随机数，会发生什么？为了便于解释，假设RAND_MAX为10，我决定通过调用RAND（）%3

我看到这个问题被问了很多次，但从来没有看到一个真正的具体答案。因此，我将在这里发表一篇文章，希望能帮助人们理解为什么在使用C++中的随机数生成器，例如“代码> RAND（））/代码>时，存在“模数偏倚”。<> > P> SO（代码> RAND）（<代码>）是一个伪随机数发生器，它选择0和->代码> RANDMAX 之间的自然数，这是在

cstdlib

中定义的常量（有关

rand（）
现在，如果你想生成一个介于0和2之间的随机数，会发生什么？为了便于解释，假设RAND_MAX
为10，我决定通过调用RAND（）%3
生成一个介于0和2之间的随机数。但是，rand（）%3
不会以相同的概率生成0和2之间的数字
当rand（）
返回0、3、6或9时，rand（）%3==0
。因此，P（0）=4/11
当rand（）
返回1、4、7或10时，rand（）%3==1
。因此，P（1）=4/11
当rand（）
返回2、5或8时，rand（）%3==2
。因此，P（2）=3/11
这不会以相同的概率生成0和2之间的数字。当然，对于小范围，这可能不是最大的问题，但对于较大范围，这可能会扭曲分布，使较小的数字产生偏差
那么，rand（）%n
何时以相同的概率返回从0到n-1的数字范围？当RAND_MAX%n==n-1
时。在这种情况下，与我们先前的假设一样，rand（）
以相同的概率返回一个介于0和rand_MAX
之间的数字，n的模类也将均匀分布
那么我们如何解决这个问题呢？一种简单的方法是不断生成随机数，直到得到所需范围内的数字：
int x; 
do {
    x = rand();
} while (x >= n);

但是，对于n
的低值来说，这是低效的，因为您只有n/RAND\u MAX
机会获得范围内的值，因此您平均需要执行RAND\u MAX/n
调用RAND（）

一种更有效的公式方法是采用一些长度可被n
整除的较大范围，如RAND_MAX-RAND_MAX%n
，不断生成随机数，直到得到一个位于该范围内，然后取模：
int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;

对于n
的小值，很少需要多次调用rand（）


引用作品及进一步阅读：




保持随机选择是消除偏差的好方法
更新
如果我们在可被n
整除的范围内搜索x，我们可以使代码快速
// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]

int x; 

// Keep searching for an x in a range divisible by n 
do {
    x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n)) 

x %= n;

上面的循环应该非常快，比如说平均1次迭代。
使用模有两个常见的问题

一个适用于所有发电机。在极限情况下更容易看到。如果生成器的RAND_MAX为2（不符合C标准），并且您只希望0或1作为值，则使用模运算将生成0的频率（当生成器生成0和2时）是生成1的频率（当生成器生成1时）的两倍。请注意，只要不删除值，这是正确的，无论您使用从生成器值到所需值的映射是什么，其中一个值的出现频率都是另一个值的两倍
某些类型的生成器的低有效位的随机性比另一个低，至少在某些参数上是如此，但遗憾的是，这些参数还有其他有趣的特性（比如能够使RAND_MAX one小于2的幂）。这个问题是众所周知的，在很长一段时间内，库实现可能会避免这个问题（例如，C标准中的示例rand（）实现使用这种生成器，但删除16个较低的有效位），但是有些人喜欢抱怨这一点，您可能会运气不好

使用类似
int alea(int n){ 
 assert (0 < n && n <= RAND_MAX); 
 int partSize = 
      n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1); 
 int maxUsefull = partSize * n + (partSize-1); 
 int draw; 
 do { 
   draw = rand(); 
 } while (draw > maxUsefull); 
 return draw/partSize; 
}

intalea（intn）{
断言（0

生成一个介于0和n之间的随机数可以避免这两个问题（并且它可以避免RAND_MAX==INT_MAX的溢出）
顺便说一句，C++11引入了标准方法来简化和其他生成器，而不是rand（）。
如图所示，“模偏差”的根源在于rand_MAX
的低值。他使用一个非常小的值RAND_MAX
（10）来表示，如果RAND_MAX为10，那么您尝试使用%生成一个介于0和2之间的数字，结果如下：
rand() % 3   // if RAND_MAX were only 10, gives
output of rand()   |   rand()%3
0                  |   0
1                  |   1
2                  |   2
3                  |   0
4                  |   1
5                  |   2
6                  |   0
7                  |   1
8                  |   2
9                  |   0

因此有4个0的输出（4/10机会），只有3个1和2的输出（每个3/10机会）
所以这是有偏见的。数字越低，出来的机会就越大
但只有当RAND_MAX
很小时，这种情况才会明显出现。或者更具体地说，与RAND\u MAX
相比，您要修改的数量较大时
比循环更好的解决方案是使用输出范围更大的PRNG（效率极低，甚至不应该建议）。该算法的最大输出为4294967295。因此，无论出于何种目的，MersenneTwister:：genrand_int32（）%10都将平均分布，模偏差效应几乎消失。
@user1413793关于这个问题是正确的。我不打算进一步讨论这个问题，只想指出一点：是的，对于n
的小值和RAND_MAX的大值，模偏差可能非常小。但是使用偏倚诱导模式意味着每次计算随机数时必须考虑偏倚，并针对不同的情况选择不同的模式。如果
/*
 * Calculate a uniformly distributed random number less than upper_bound
 * avoiding "modulo bias".
 *
 * Uniformity is achieved by generating new random numbers until the one
 * returned is outside the range [0, 2**32 % upper_bound).  This
 * guarantees the selected random number will be inside
 * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
 * after reduction modulo upper_bound.
 */
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
    u_int32_t r, min;

    if (upper_bound < 2)
        return 0;

    /* 2**32 % x == (2**32 - x) % x */
    min = -upper_bound % upper_bound;

    /*
     * This could theoretically loop forever but each retry has
     * p > 0.5 (worst case, usually far better) of selecting a
     * number inside the range we need, so it should rarely need
     * to re-roll.
     */
    for (;;) {
        r = arc4random();
        if (r >= min)
            break;
    }

    return r % upper_bound;
}

public int nextInt(int n) {
   if (n <= 0)
     throw new IllegalArgumentException("n must be positive");

   if ((n & -n) == n)  // i.e., n is a power of 2
     return (int)((n * (long)next(31)) >> 31);

   int bits, val;
   do {
       bits = next(31);
       val = bits % n;
   } while (bits - val + (n-1) < 0);
   return val;
 }

int unbiased_random_bit() {    
    int x1, x2, prev;
    prev = 2;
    x1 = rand() % 2;
    x2 = rand() % 2;

    for (;; x1 = rand() % 2, x2 = rand() % 2)
    {
        if (x1 ^ x2)      // 01 -> 1, or 10 -> 0.
        {
            return x2;        
        }
        else if (x1 & x2)
        {
            if (!prev)    // 0011
                return 1;
            else
                prev = 1; // 1111 -> continue, bias unresolved
        }
        else
        {
            if (prev == 1)// 1100
                return 0;
            else          // 0000 -> continue, bias unresolved
                prev = 0;
        }
    }
}

000 = 0, 001 = 1, 010 = 2, 011 = 3
100 = 4, 101 = 5, 110 = 6, 111 = 7

4 discarded results / 16 possibilities = 25%

32 % 6 = 2 discarded results; and
2 discarded results / 32 possibilities = 6.25%

[2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)

#include <iostream>
#include <assert.h>
#include <limits>
#include <openssl/rand.h>

volatile uint32_t dummy;
uint64_t discardCount;

uint32_t uniformRandomUint32(uint32_t upperBound)
{
    assert(RAND_status() == 1);
    uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
    uint64_t randomPool = RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));

    while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
        RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
        ++discardCount;
    }

    return randomPool % upperBound;
}

int main() {
    discardCount = 0;

    const uint32_t MODULUS = (1ul << 31)-1;
    const uint32_t ROLLS = 10000000;

    for(uint32_t i = 0; i < ROLLS; ++i) {
        dummy = uniformRandomUint32(MODULUS);
    }
    std::cout << "Discard count = " << discardCount << std::endl;
}

next: n

    | bitSize r from to |
    n < 0 ifTrue: [^0 - (self next: 0 - n)].
    n = 0 ifTrue: [^nil].
    n = 1 ifTrue: [^0].
    cache isNil ifTrue: [cache := OrderedCollection new].
    cache size < (self randmax highBit) ifTrue: [
        Security.DSSRandom default next asByteArray do: [ :byte |
            (1 to: 8) do: [ :i |    cache add: (byte bitAt: i)]
        ]
    ].
    r := 0.
    bitSize := n highBit.
    to := cache size.
    from := to - bitSize + 1.
    (from to: to) do: [ :i |
        r := r bitAt: i - from + 1 put: (cache at: i)
    ].
    cache removeFrom: from to: to.
    r >= n ifTrue: [^self next: n].
    ^r

int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;

EG: 

Ran Max Value (RM) = 255
Valid Outcome (N) = 4

When X => 252, Discarded values for X are: 252, 253, 254, 255

So, if Random Value Selected (X) = {252, 253, 254, 255}

Number of discarded Values (I) = RM % N + 1 == N

 IE:

 I = RM % N + 1
 I = 255 % 4 + 1
 I = 3 + 1
 I = 4

   X => ( RM - RM % N )
 255 => (255 - 255 % 4) 
 255 => (255 - 3)
 255 => (252)

 Discard Returns $True

D = (RM - N)

RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%

RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
 
 x %= n;

RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.

RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n  ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );

    x %= n;
} else {
    x = rand();
}

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
 
 x %= n;

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );

    x %= n;
} else {
    x = rand();
}

// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x; // Resulting random number
int y; // One-time calculation of the compare value for x

y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) 

if n != 0 {
    do {
        x = rand();
    } while (x > y);

    x %= n;
} else {
    x = rand();
}

function randomInt(minInclusive, maxExclusive) {
 var maxInclusive = (maxExclusive - minInclusive) - 1
 var x = 1
 var y = 0
 while(true) {
    x = x * 2
    var randomBit = (Math.random() < 0.5 ? 0 : 1)
    y = y * 2 + randomBit
    if(x > maxInclusive) {
      if (y <= maxInclusive) { return y + minInclusive }
      // Rejection
      x = x - maxInclusive - 1
      y = y - maxInclusive - 1
    }
 }
}