如何洗牌存储在Python文件中的非常大的列表？_Python_List_Numpy_Random_Pycrypto

如何洗牌存储在Python文件中的非常大的列表？

python list numpy random

如何洗牌存储在Python文件中的非常大的列表？,python,list,numpy,random,pycrypto,Python,List,Numpy,Random,Pycrypto,我需要确定地生成一个随机列表，其中包含从0到2^32-1的数字这将是一种幼稚的（完全没有功能的）方式，只是为了清楚我想要什么随机导入数字=范围（2**32）随机种子（0）随机。随机（数字）我尝试过使用numpy.arange（）创建列表，并使用pycrypto的random.shuffle（）对其进行洗牌。列表消耗了大约8gb的内存，然后洗牌将其提高到25gb左右。我只有32gb可供选择。但这并不重要，因为我尝试过将列表分成1024个部分，然后尝试上面的方法，但即使是其中的一个部

我需要确定地生成一个随机列表，其中包含从0到2^32-1的数字

这将是一种幼稚的（完全没有功能的）方式，只是为了清楚我想要什么

随机导入
数字=范围（2**32）
随机种子（0）
随机。随机（数字）

我尝试过使用

numpy.arange（）

创建列表，并使用pycrypto的

random.shuffle（）

对其进行洗牌。列表消耗了大约8gb的内存，然后洗牌将其提高到25gb左右。我只有32gb可供选择。但这并不重要，因为

我尝试过将列表分成1024个部分，然后尝试上面的方法，但即使是其中的一个部分也需要花费很长的时间。我把其中一片切成128片更小的片，每片大约需要620毫秒。如果它呈线性增长，那么这意味着整个过程需要大约22个半小时才能完成。这听起来不错，但它不是线性增长的

我尝试过的另一件事是为每个条目生成随机数，并将其用作新位置的索引。然后我沿着列表往下看，并尝试将数字放在新的索引中。如果该索引已在使用中，则该索引将递增，直到找到可用索引为止。这在理论上是可行的，它可以完成大约一半的工作，但在接近尾声时，它必须不断地寻找新的地点，在名单上绕了几圈

有什么办法可以解决这个问题吗？这是一个可行的目标吗？

因此，一种方法是跟踪您已经发出的数字，并一次一个地发出新的随机数字，请考虑

import random
random.seed(0)

class RandomDeck:
      def __init__(self):
           self.usedNumbers = set()

      def draw(self):
          number = random.randint(0,2**32)
          while number in self.usedNumbers:
                 number = random.randint(0,2**32)
          self.usedNumbers.append(number)
          return number

      def shuffle(self):
          self.usedNumbers = set()

正如你所见，我们基本上有一组介于0和2^32之间的随机数，但我们只存储我们给出的数字，以确保没有重复。然后你可以通过忘记所有你已经给出的数字来重新洗牌

这在大多数使用情况下都是有效的，只要你不需要重新调整就抽取100万个数字。

因此，一种方法是跟踪你已经发出的数字，并一次一个地发出新的随机数，请考虑

import random
random.seed(0)

class RandomDeck:
      def __init__(self):
           self.usedNumbers = set()

      def draw(self):
          number = random.randint(0,2**32)
          while number in self.usedNumbers:
                 number = random.randint(0,2**32)
          self.usedNumbers.append(number)
          return number

      def shuffle(self):
          self.usedNumbers = set()

这在大多数用例中应该是有效的，只要您不需要重新排列就提取约100万个数字。

如果您有一个连续的数字范围，您根本不需要存储它们。在无序列表中的值与其在该列表中的位置之间设计双向映射很容易。其思想是使用伪随机排列，这正是它所提供的

诀窍是找到一个完全符合32位整数要求的分组密码。这种分组密码很少，但是Simon和Speck密码（由NSA发布）是可参数化的，并且支持32位的块大小（通常块大小要大得多）

似乎提供了一个实现。我们可以设计以下功能：

def get_value_from_index(key, i):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.encrypt(i)

def get_index_from_value(key, val):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.decrypt(val)

该库使用Python的大整数，因此您甚至可能不需要对它们进行编码

64位密钥（例如

0x123456789ABCDEF0

）并不多。您可以使用类似的构造，将DES中的密钥大小增加到三倍DES。请记住，关键点应该随机选择，如果您想要决定论，它们必须是常量

如果你不想使用国家安全局的算法，我会理解的。还有其他的，但我现在找不到。草率的布丁密码甚至更灵活，但我不知道Python是否有这样的实现。

似乎提供了一个实现。我们可以设计以下功能：

def get_value_from_index(key, i):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.encrypt(i)

def get_index_from_value(key, val):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.decrypt(val)

该库使用Python的大整数，因此您甚至可能不需要对它们进行编码

64位密钥（例如

0x123456789ABCDEF0

）并不多。您可以使用类似的构造，将DES中的密钥大小增加到三倍DES。请记住，关键点应该随机选择，如果您想要决定论，它们必须是常量

如果你不想使用国家安全局的算法，我会理解的。还有其他的，但我现在找不到。草率的布丁密码甚至更灵活，但我不知道Python是否有这样的实现。

我创建的类使用了一个位数组来跟踪已经使用过的数字。有了这些注释，我认为代码是非常不言自明的

import bitarray
import random


class UniqueRandom:
    def __init__(self):
        """ Init boolean array of used numbers and set all to False
        """
        self.used = bitarray.bitarray(2**32)
        self.used.setall(False)

    def draw(self):
        """ Draw a previously unused number
         Return False if no free numbers are left
        """

        # Check if there are numbers left to use; return False if none are left
        if self._free() == 0:
            return False

        # Draw a random index
        i = random.randint(0, 2**32-1)

        # Skip ahead from the random index to a undrawn number
        while self.used[i]:
            i = (i+1) % 2**32

        # Update used array
        self.used[i] = True

        # return the selected number
        return i

    def _free(self):
        """ Check how many places are unused
        """
        return self.used.count(False)


def main():
    r = UniqueRandom()
    for _ in range(20):
        print r.draw()


if __name__ == '__main__':
    main()

设计注意事项

虽然Garrigan Stafford的答案很好，但该解决方案的内存占用要小得多（略大于4GB）。我们的答案之间的另一个区别是，当生成的数字数量增加时，加里根的算法需要更多的时间来生成一个随机数（因为他不断迭代，直到找到一个未使用的数字）。如果某个数字已被使用，该算法只查找下一个未使用的数字。这使得每次绘制一个数字所需的时间实际上是相同的，而不管空闲数字池耗尽了多远。

我创建的类使用一个位数组来跟踪已经使用的数字。和

def _is_prime(n):
    if n == 2:
        return True
    if n == 1 or n % 2 == 0:
        return False

    for d in range(3, floor(sqrt(n)) + 1, 2):  # can use isqrt in Python 3.8
        if n % d == 0:
            return False

    return True


class Permutation(Range):
    """
    Generates a random permutation of integers from 0 up to size.
    Inspired by https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
    """

    size: int
    prime: int
    seed: int

    def __init__(self, size: int, seed: int):
        self.size = size
        self.prime = self._get_prime(size)
        self.seed = seed % self.prime

    def __getitem__(self, index):
        x = self._map(index)

        while x >= self.size:
            # If we map to a number greater than size, then the cycle of successive mappings must eventually result
            # in a number less than size. Proof: The cycle of successive mappings traces a path
            # that either always stays in the set n>=size or it enters and leaves it,
            # else the 1:1 mapping would be violated (two numbers would map to the same number).
            # Moreover, `set(range(size)) - set(map(n) for n in range(size) if map(n) < size)`
            # equals the `set(map(n) for n in range(size, prime) if map(n) < size)`
            # because the total mapping is exhaustive.
            # Which means we'll arrive at a number that wasn't mapped to by any other valid index.
            # This will take at most `prime-size` steps, and `prime-size` is on the order of log(size), so fast.
            # But usually we just need to remap once.
            x = self._map(x)

        return x

    @staticmethod
    def _get_prime(size):
        """
        Returns the prime number >= size which has the form (4n-1)
        """
        n = size + (3 - size % 4)
        while not _is_prime(n):
            # We expect to find a prime after O(log(size)) iterations
            # Using a brute-force primehood test, total complexity is O(log(size)*sqrt(size)), which is pretty good.
            n = n + 4
        return n

    def _map(self, index):
        a = self._permute_qpr(index)
        b = (a + self.seed) % self.prime
        c = self._permute_qpr(b)
        return c

    def _permute_qpr(self, x):
        residue = pow(x, 2, self.prime)

        if x * 2 < self.prime:
            return residue
        else:
            return self.prime - residue