Python 将一个二维向量多次添加到另一个二维向量的有效方法

Python 将一个二维向量多次添加到另一个二维向量的有效方法,python,arrays,numpy,Python,Arrays,Numpy,我试图将一个二维向量多次添加到另一个二维向量。 因此,我有一个矩阵,它多次被矩阵2填充,但是矩阵增长越多,添加到矩阵所需的时间就越多 这是我的实际代码: import numpy as np # dummy function just for testing def get_max_subtree_length(groups): return 20 def pad_groups(dataset, groups): dataset = np.array(dataset)

我试图将一个二维向量多次添加到另一个二维向量。 因此,我有一个
矩阵
,它多次被
矩阵2
填充,但是
矩阵
增长越多,添加到
矩阵
所需的时间就越多

这是我的实际代码:

import numpy as np


# dummy function just for testing
def get_max_subtree_length(groups):
    return 20

def pad_groups(dataset, groups):
    dataset = np.array(dataset)
    max_subtree_length = get_max_subtree_length(groups)
    padded_dataset = np.array([[]])
    start_range = 0
    dataset_row_length = len(dataset[0]) - 1
    zeros_pad = np.zeros(dataset_row_length)
    for group in groups:
        pad = np.array([group[0]])
        pad = np.append(pad, zeros_pad)
        end_range = start_range + group[1]
        subtree = dataset[start_range:end_range, :]
        if len(padded_dataset[0]) == 0:
            padded_dataset = subtree
        else:
            padded_dataset = np.vstack([padded_dataset, subtree])
        subtree_length = group[1]
        subtree_to_pad = max_subtree_length - subtree_length
        # Append subtree_to_pad (number of pad to append) times the same pad array to the dataset
        pads = np.repeat([pad], subtree_to_pad, axis=0)
        padded_dataset = np.vstack([padded_dataset, pads])
        start_range = end_range
    return padded_dataset
要测试它,请执行以下操作:

dataset = np.array([
    [1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 2, 3], [2, 2, 3],
    [2, 2, 3], [3, 2, 3], [3, 2, 3], [3, 2, 3], [4, 2, 3],
    [4, 2, 3], [4, 2, 3], [5, 2, 3], [5, 2, 3], [5, 2, 3],
    [6, 2, 3], [6, 2, 3], [6, 2, 3], [7, 2, 3], [7, 2, 3],
    [7, 2, 3], [8, 2, 3], [8, 2, 3], [8, 2, 3]])

groups = [(1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3), (7, 3), (8, 3)]

dataset = pad_groups(dataset, groups)
print(len(dataset))
# 160
print(dataset)
# [[1. 2. 3.]
#  [1. 2. 3.]
#  [1. 2. 3.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [1. 0. 0.]
#  [2. 2. 3.]
#  [2. 2. 3.]
#  [2. 2. 3.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [2. 0. 0.]
#  [3. 2. 3.]
#  [3. 2. 3.]
#  [3. 2. 3.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [3. 0. 0.]
#  [4. 2. 3.]
#  [4. 2. 3.]
#  [4. 2. 3.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [4. 0. 0.]
#  [5. 2. 3.]
#  [5. 2. 3.]
#  [5. 2. 3.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [5. 0. 0.]
#  [6. 2. 3.]
#  [6. 2. 3.]
#  [6. 2. 3.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [6. 0. 0.]
#  [7. 2. 3.]
#  [7. 2. 3.]
#  [7. 2. 3.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [7. 0. 0.]
#  [8. 2. 3.]
#  [8. 2. 3.]
#  [8. 2. 3.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]
#  [8. 0. 0.]]
在这种情况下,
matrix
padded\u数据集
matrix2
pads

组的长度为122000

更新

要模拟功能,请使用IDE:


如何以更有效的方式执行此操作?

您可以考虑替换
np.vstack()
np.append()
和类似的
列表
操作,并在最后将最终结果转换为
np.array()
。最终结果可能类似于:

def pad_groups_opt(dataset, groups):
    dataset = np.array(dataset)
    max_subtree_length = get_max_subtree_length(groups)
    start = 0
    rows, cols = dataset.shape
    padded_dataset = []
    for group in groups:
        pad = [group[0]] + [0] * (cols - 1)
        stop = start + group[1]
        subtree = dataset[start:stop].tolist()
        padded_dataset.extend(subtree)
        subtree_to_pad = max_subtree_length - group[1]
        pads = [pad] * subtree_to_pad
        padded_dataset.extend(pads)
        start = stop
    return np.array(padded_dataset)
并根据原始代码对其进行测试:

dataset = np.array([
    [1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 2, 3], [2, 2, 3],
    [2, 2, 3], [3, 2, 3], [3, 2, 3], [3, 2, 3], [4, 2, 3],
    [4, 2, 3], [4, 2, 3], [5, 2, 3], [5, 2, 3], [5, 2, 3],
    [6, 2, 3], [6, 2, 3], [6, 2, 3], [7, 2, 3], [7, 2, 3],
    [7, 2, 3], [8, 2, 3], [8, 2, 3], [8, 2, 3]])
groups = [(1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3), (7, 3), (8, 3)]

print(np.all(pad_groups(dataset, groups) == pad_groups_opt(dataset, groups)))
# True
就时间而言,您可以通过输入获得约2倍的加速:

%timeit pad_groups(dataset, groups)
# 10000 loops, best of 3: 169 µs per loop
%timeit pad_groups_opt(dataset, groups)
# 10000 loops, best of 3: 89.3 µs per loop
对于更大的输入,它似乎变得更好(~10倍):

%timeit pad_groups(dataset.tolist() * 100, groups * 100)
# 10 loops, best of 3: 107 ms per loop
%timeit pad_groups_opt(dataset.tolist() * 100, groups * 100)
# 100 loops, best of 3: 9.21 ms per loop

请您提供完整的代码(我读了一个
return
但没有
def…
)以及一些测试输入/输出,好吗?@norok2我提供了完整的功能代码。在这种状态下,这个函数可以工作,但在某一点上会变慢。不是真的。尝试将代码复制粘贴到一个新的解释器中并运行函数,您将了解缺少的内容。此外,缺少一些测试输入/输出。我们应该猜一下
数据集
是什么吗?我目前正在google colab上运行这个函数,它可以工作
dataset
是一个数值向量矩阵,
对其进行分组
它是一个对的列表,其中对的第一个元素是
dataset
的批的标识符,第二个元素告诉我该批有多长。我要做的是将每个批次填充到一个固定长度,该长度等于
max\u subtree\u length
。如果您认为所需的所有代码都已发布,请删除其中的噪音(说明您的问题不需要的所有代码)。另见,谢谢,非常快。不幸的是,ram填满了所有的内存。也许我必须减少组的数量。如果你能避免从/到NumPy数组的转换,那将为你节省一些内存。另外,如果您需要“逐行”使用结果,您可以考虑将代码重写为生成器,我正在填充数据集以将其放入LSTM中。“代码作为生成器”是什么意思?使用
yield
而不是
return
,也许您可以通读一遍,但如果这是LSTM的输入,我认为您不容易做到这一点。