如何用生成器中的值填充2D Python numpy数组？_Python_Arrays_Numpy_Multidimensional Array_Itertools

如何用生成器中的值填充2D Python numpy数组？

python arrays numpy

如何用生成器中的值填充2D Python numpy数组？,python,arrays,numpy,multidimensional-array,itertools,Python,Arrays,Numpy,Multidimensional Array,Itertools,根据答案，似乎没有一种简单的方法可以用生成器中的数据填充2D numpy数组然而，如果有人能想出一种方法来矢量化或以其他方式加速以下功能，我将不胜感激这里的区别在于，我希望成批处理生成器中的值，而不是在内存中创建整个数组。我能想到的唯一方法就是使用for循环 import numpy as np from itertools import permutations permutations_of_values = permutations(range(1,20), 7) def arra

根据答案，似乎没有一种简单的方法可以用生成器中的数据填充2D numpy数组

然而，如果有人能想出一种方法来矢量化或以其他方式加速以下功能，我将不胜感激

这里的区别在于，我希望成批处理生成器中的值，而不是在内存中创建整个数组。我能想到的唯一方法就是使用for循环

import numpy as np
from itertools import permutations

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator(generator, arr):
    """Fills the numpy array provided with values from
    the generator provided. Number of columns in arr
    must match the number of values yielded by the 
    generator."""
    count = 0
    for row in arr:
        try:
            item = next(generator)
        except StopIteration:
            break
        row[:] = item
        count += 1
    return arr[:count,:]

batch_size = 100000

empty_array = np.empty((batch_size, 7), dtype=int)
batch_of_values = array_from_generator(permutations_of_values, empty_array)

print(batch_of_values[0:5])

输出：

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

速度测试：

%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop

%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop

补充：

正如@COLDSPEED（谢谢）所建议的，这里是一个使用列表从生成器收集数据的版本。它的速度大约是上述代码的两倍。有谁能改进这一点：

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator2(generator, rows=batch_size):
    """Creates a numpy array from a specified number 
    of values from the generator provided."""
    data = []
    for row in range(rows):
        try:
            data.append(next(generator))
        except StopIteration:
            break
    return np.array(data)

batch_size = 100000

batch_of_values = array_from_generator2(permutations_of_values, rows=100000)

print(batch_of_values[0:5])

输出：

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

速度测试：

%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop

%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop

您可以在基本不变的时间内计算前面的尺寸。只需这样做，然后使用

numpy.fromiter

：

In [1]: import math, from itertools import permutations, chain

In [2]: def n_chose_k(n, k, fac=math.factorial):
    ...:     return fac(n)/fac(n-k)
    ...:

In [3]: def permutations_to_array(r, k):
    ...:     n = len(r)
    ...:     size = int(n_chose_k(n, k))
    ...:     it = permutations(r, k)
    ...:     arr = np.fromiter(chain.from_iterable(it),
    ...:                       count=size,  dtype=int)
    ...:     arr.size = size//k, k
    ...:     return arr
    ...:

In [4]: arr = permutations_to_array(range(1,20), 7)

In [5]: arr.shape
Out[5]: (36279360, 7)

In [6]: arr[0:5]
Out[6]:
array([[ 1,  2,  3,  4,  5,  6,  7],
       [ 1,  2,  3,  4,  5,  6,  8],
       [ 1,  2,  3,  4,  5,  6,  9],
       [ 1,  2,  3,  4,  5,  6, 10],
       [ 1,  2,  3,  4,  5,  6, 11]])

只要

仅限于具有

len

的序列，这将起作用

编辑以添加我为

batchsize*k

chunk的生成器编写的实现，带有修剪选项

import math
from itertools import repeat, chain

import numpy as np

def n_chose_k(n, k, fac=math.factorial):
    return fac(n)/fac(n-k)

def permutations_in_batches(r, k, batchsize=None, fill=0, dtype=int, trim=False):
    n = len(r)
    size = int(n_chose_k(n, k))
    if batchsize is None or batchsize > size:
        batchsize = size
    perms = chain.from_iterable(permutations(r, k))
    count = batchsize*k
    remaining = size - count
    while remaining > 0:
        current = np.fromiter(perms, count=count, dtype=dtype)
        current.shape = batchsize, k
        yield current
        remaining -= count
    if remaining: # remaining is negative
        remaining = -remaining
        if not trim:
            padding = repeat(fill, remaining)
            finalcount = count
            finalshape = batchsize, k
        else:
            q = remaining//k # always divisible q%k==0
            finalcount = q*k
            padding = repeat(fill, remaining)
            finalshape = q, k
        current =  np.fromiter(chain(perms, padding), count=finalcount, dtype=dtype)
        current.shape = finalshape
    else: # remaining is 0
        current = np.fromiter(perms, count=batchsize, dtype=dtype)
        current.shape = batchsize, k
    yield current

填写一个列表，然后对结果调用

np.array

，应该更简单。

fromiter

，正如在两个链接答案中所讨论的，是直接从生成器的输出创建数组的唯一方法。否则，您需要创建一个列表并从中构建或填充数组。生成器可以在中间处理过程中节省内存（c.f.相当于列表），但不会更快。

fromiter

会更好，但它只适用于系列（一维数组）。您能提前知道尺寸吗？然后，您仍然可以使用

fromiter

如果您阅读了文档，它说明

fromiter

创建了“一个新的一维数组，它来自一个iterable对象”。我在这里尝试做的是二维的，因为生成器中的每个项都是一个由7个值组成的元组。也许是时候从iter扩展

来处理多维迭代器了……你从哪里来的？@Bill抱歉，忘了把它放在答案里了非常好，谢谢。不过，我认为生成的数组的维数不太正确。是不是应该是count=size*k
和arr.resize（（size，k））
？@Bill，根据我的经验，iter的count
在运行时间上没有太大差别。而且arr.reforme（-1，k）
也不需要这个尺寸。@如果你想看的话，比尔最后做了一个来取乐