Algorithm 在Theano中循环（或矢量化）可变长度矩阵_Algorithm_Optimization_Matrix_Vectorization_Theano

Algorithm 在Theano中循环（或矢量化）可变长度矩阵

algorithm optimization matrix

Algorithm 在Theano中循环（或矢量化）可变长度矩阵,algorithm,optimization,matrix,vectorization,theano,Algorithm,Optimization,Matrix,Vectorization,Theano,我有一个矩阵列表L，其中每个项目M都是x*n矩阵（x是变量，n是常数）我想计算L中所有项的M'*M之和（M'是M的转置），如下Python代码所示： for M in L: res += np.dot(M.T, M) 实际上，我想在Theano中实现这一点（它不支持可变长度多维数组），我不想将所有矩阵填充到相同的大小，因为这将浪费太多空间（一些矩阵可能非常大）有更好的方法吗编辑： L在Theano编译之前已知编辑：从@DanielRenshaw和@Divakar那里得到了两个非常

我有一个矩阵列表

，其中每个项目

都是

x*n

矩阵（

是变量，

是常数）

我想计算

中所有项的

M'*M

之和（

M'

是

的转置），如下Python代码所示：

for M in L:
  res += np.dot(M.T, M)

实际上，我想在Theano中实现这一点（它不支持可变长度多维数组），我不想将所有矩阵填充到相同的大小，因为这将浪费太多空间（一些矩阵可能非常大）

有更好的方法吗

编辑：

在Theano编译之前已知

编辑：

从@DanielRenshaw和@Divakar那里得到了两个非常好的答案，在情感上很难选择一个来接受。

考虑到在需要进行Theano编译之前，矩阵的数量是已知的，可以简单地使用Theano矩阵的常规Python列表

下面是一个完整的示例，展示了numpy和Theano版本之间的差异

此代码已更新，以包含与@Divakar的矢量化方法的比较，后者性能更好。Theano可以采用两种矢量化方法，一种是Theano执行级联，另一种是numpy执行级联，然后将级联结果传递给Theano

import timeit
import numpy as np
import theano
import theano.tensor as tt


def compile_theano_version1(number_of_matrices, n, dtype):
    assert number_of_matrices > 0
    assert n > 0
    L = [tt.matrix() for _ in xrange(number_of_matrices)]
    res = tt.zeros(n, dtype=dtype)
    for M in L:
        res += tt.dot(M.T, M)
    return theano.function(L, res)


def compile_theano_version2(number_of_matrices):
    assert number_of_matrices > 0
    L = [tt.matrix() for _ in xrange(number_of_matrices)]
    concatenated_L = tt.concatenate(L, axis=0)
    res = tt.dot(concatenated_L.T, concatenated_L)
    return theano.function(L, res)


def compile_theano_version3():
    concatenated_L = tt.matrix()
    res = tt.dot(concatenated_L.T, concatenated_L)
    return theano.function([concatenated_L], res)


def numpy_version1(*L):
    assert len(L) > 0
    n = L[0].shape[1]
    res = np.zeros((n, n), dtype=L[0].dtype)
    for M in L:
        res += np.dot(M.T, M)
    return res


def numpy_version2(*L):
    concatenated_L = np.concatenate(L, axis=0)
    return np.dot(concatenated_L.T, concatenated_L)


def main():
    iteration_count = 100
    number_of_matrices = 20
    n = 300
    min_x = 400
    dtype = 'float64'
    theano_version1 = compile_theano_version1(number_of_matrices, n, dtype)
    theano_version2 = compile_theano_version2(number_of_matrices)
    theano_version3 = compile_theano_version3()
    L = [np.random.standard_normal(size=(x, n)).astype(dtype)
         for x in range(min_x, number_of_matrices + min_x)]

    start = timeit.default_timer()
    numpy_res1 = np.sum(numpy_version1(*L)
                        for _ in xrange(iteration_count))
    print 'numpy_version1', timeit.default_timer() - start

    start = timeit.default_timer()
    numpy_res2 = np.sum(numpy_version2(*L)
                        for _ in xrange(iteration_count))
    print 'numpy_version2', timeit.default_timer() - start

    start = timeit.default_timer()
    theano_res1 = np.sum(theano_version1(*L)
                         for _ in xrange(iteration_count))
    print 'theano_version1', timeit.default_timer() - start

    start = timeit.default_timer()
    theano_res2 = np.sum(theano_version2(*L)
                         for _ in xrange(iteration_count))
    print 'theano_version2', timeit.default_timer() - start

    start = timeit.default_timer()
    theano_res3 = np.sum(theano_version3(np.concatenate(L, axis=0))
                         for _ in xrange(iteration_count))
    print 'theano_version3', timeit.default_timer() - start

    assert np.allclose(numpy_res1, numpy_res2)
    assert np.allclose(numpy_res2, theano_res1)
    assert np.allclose(theano_res1, theano_res2)
    assert np.allclose(theano_res2, theano_res3)


main()

当运行此打印时（类似于）

断言通过了，这表明Theano和numpy版本都以高精度计算相同的结果。显然，如果使用

float32

而不是

float64

，此精度将降低

计时结果表明，矢量化方法可能并不可取，它取决于矩阵的大小。在上面的示例中，矩阵较大，非串联方法更快，但如果

和

min\u x

参数在

main

函数中更改为更小，则串联方法更快。在GPU上运行时，其他结果可能会保持不变（仅限Theano版本）。

您只需沿第一个轴（即所有

的总和）填充输入阵列即可。因此，我们将得到一个高的

（X，n）

数组，其中

X=x1+x2+x3+..

。这可以被转换，其点积及其自身将是形状

（n，n）

的期望输出。所有这些都是通过利用强大的点产品的纯矢量化解决方案实现的。因此，执行工作将是非常重要的-

# Concatenate along axis=0
Lcat = np.concatenate(L,axis=0)

# Perform dot product of the transposed version with self
out = Lcat.T.dot(Lcat)

运行时测试并验证输出-

In [116]: def vectoized_approach(L):
     ...:   Lcat = np.concatenate(L,axis=0)
     ...:   return Lcat.T.dot(Lcat)
     ...: 
     ...: def original_app(L):
     ...:   n = L[0].shape[1]
     ...:   res = np.zeros((n,n))
     ...:   for M in L:
     ...:     res += np.dot(M.T, M)
     ...:   return res
     ...: 

In [117]: # Input
     ...: L = [np.random.rand(np.random.randint(1,9),5)for iter in range(1000)]

In [118]: np.allclose(vectoized_approach(L),original_app(L))
Out[118]: True

In [119]: %timeit original_app(L)
100 loops, best of 3: 3.84 ms per loop

In [120]: %timeit vectoized_approach(L)
1000 loops, best of 3: 632 µs per loop

除了@DanielRenshaw的答案之外，如果我们将矩阵的数量增加到1000，则

编译无版本1

函数将产生

运行时错误：超过了最大递归深度，编译无版本2
似乎要花很长时间
通过使用键入的\u列表
，可以解决此问题：
def compile_theano_version4(number_of_matrices, n):
    import theano.typed_list
    L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))()
    res, _ = theano.scan(fn=lambda i: tt.dot(L[i].T, L[i]),
                         sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')])
    return theano.function([L], res.sum(axis=0))

此外，我将所有相关变量的数据类型设置为float32
，并在GPU上运行@DanielRenshaw的脚本，结果证明@Divakar的建议（theano_version3
）在这种情况下是最有效的。尽管正如@DanielRenshaw所说，使用大型矩阵可能并不总是一种好的做法
以下是我的机器上的设置和输出
iteration_count = 100
number_of_matrices = 200
n = 300
min_x = 20
dtype = 'float32'
theano.config.floatX = dtype


numpy_version1 5.30542397499
numpy_version2 3.96656394005
theano_version1 5.26742005348
theano_version2 1.76983904839
theano_version3 1.03577589989
theano_version4 5.58366179466

在Theano编译之前，L
的长度是已知的吗？@DanielRenshaw是的，而且L中每个矩阵的形状也是已知的谢谢你这么多Daniel，这对我很有用。你能用一个更大的数字来表示矩阵的数量吗？由于最初的代码循环通过了它，所以有一个足够大的数字是有意义的。将矩阵的数量从20增加到200不会改变相对计时。当矩阵较大时，串联+矢量化点仍然明显比一次迭代一个矩阵慢。如果xs的大小变化较小（即无需大量填充矩阵），这确实是首选方法。我已经更新了我的答案，以提供更全面的比较，包括这种方法。@DanielRenshaw好吧，这种方法只是串联，这里没有填充。因此，如果输入列表中有足够数量的数组，我认为输入数组的形状不会影响性能变化。这种方法的Theano版本需要填充。@DanielRenshaw！会吗！？嗯，我对西亚诺了解不多，我想！谢谢你添加所有运行时测试。对不起，我在胡说八道。确实可以在没有填充的情况下执行此操作。我会再次更新我的答案。
iteration_count = 100
number_of_matrices = 200
n = 300
min_x = 20
dtype = 'float32'
theano.config.floatX = dtype


numpy_version1 5.30542397499
numpy_version2 3.96656394005
theano_version1 5.26742005348
theano_version2 1.76983904839
theano_version3 1.03577589989
theano_version4 5.58366179466