Python NumPy/SciPy中的多线程整数矩阵乘法_Python_Multithreading_Numpy_Matrix Multiplication_Blas

Python NumPy/SciPy中的多线程整数矩阵乘法

python multithreading numpy

Python NumPy/SciPy中的多线程整数矩阵乘法,python,multithreading,numpy,matrix-multiplication,blas,Python,Multithreading,Numpy,Matrix Multiplication,Blas,像这样做 import numpy as np a = np.random.rand(10**4, 10**4) b = np.dot(a, a) 使用多个内核，运行良好然而，a中的元素是64位浮点（或32位平台中的32位浮点？），我想将8位整数数组相乘。不过，请尝试以下操作： a = np.random.randint(2, size=(n, n)).astype(np.int8) 导致dot产品不使用多核，因此在我的电脑上运行速度慢约1000倍 array: np.random.ran

像这样做

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

使用多个内核，运行良好

然而，

中的元素是64位浮点（或32位平台中的32位浮点？），我想将8位整数数组相乘。不过，请尝试以下操作：

a = np.random.randint(2, size=(n, n)).astype(np.int8)

导致dot产品不使用多核，因此在我的电脑上运行速度慢约1000倍

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested

我知道NumPy使用BLAS，它不支持整数，但如果我使用SciPy BLAS包装器，即

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

计算是多线程的。现在，

blas.sgemm

以与float32的

np.dot

完全相同的时间运行，但对于非floats，它将所有内容转换为

float32

并输出floats，而这是

np.dot

所不具备的。（此外，

现在处于

F_连续

顺序，这是一个较小的问题）

因此，如果我想进行整数矩阵乘法，我必须执行以下操作之一：

使用NumPy非常慢的

np.dot

，并且很高兴我能够保留8位整数

使用SciPy的

sgemm

并使用4倍的内存

使用Numpy的

np.float16

并仅使用2x内存，注意

np.dot

在float16数组上比在float32数组上慢得多，比int8更慢

为多线程整数矩阵乘法找到一个优化的库（实际上，Mathematica可以这样做，但我更喜欢Python解决方案），理想情况下支持1位数组，尽管8位数组也可以。。。（实际上，我的目标是在有限域Z/2Z上进行矩阵乘法，我知道我可以使用Sage，这很像Python，但是，还有严格意义上的Python吗？）

我可以按照选项4吗？有这样的图书馆吗

免责声明：我实际上正在运行NumPy+MKL，但我在vanilly NumPy上尝试了类似的测试，结果也类似。

请注意，虽然这个答案变得过时，但NumPy可能会获得优化的整数支持。请验证此答案在您的设置中是否仍能更快地工作

选项5-滚动自定义解决方案：将矩阵产品划分为几个子产品，并并行执行这些子产品。这可以通过标准Python模块相对容易地实现。使用
```
numpy.dot
```
计算子产品，这将释放全局解释器锁。因此，可以使用相对轻量级且可以从主线程访问阵列的内存效率

实施：

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

通过此实现，我获得了大约x4的加速比，这是我机器中的物理内核数：

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

“”解释了整数速度如此之慢的原因：首先，CPU具有高通量浮点管道。其次，BLAS没有整数类型

解决方法：将矩阵转换为

float32

值会获得很大的加速。2015年MacBook Pro的90倍加速效果如何？（使用

float64

的效果是原来的一半。）

关于您的第4号选项，您是否可以看看？它们允许在GPU上进行大型操作（与numpy具有简单的接口），性能相当好。作为选项4的一个可能答案，看起来很有趣。“M4RI是一个用于F2上密集矩阵的快速算法库。”我猜Sage已经在使用这个库了，但我不明白为什么不能直接从Python使用它，并使用合适的Cython包装器。（事实上，您可能已经在Sage源代码中找到了这样一个包装器。）还没有人提到

numpy.einsum

，但这可能是一个不错的选择5。请注意，如果要避免整数溢出，您需要将结果转换为更大的值。如果每个元素都是0或1，则需要一个整数格式，该格式至少可以保存

的值，以保证不会溢出。对于

n=10000

的示例，（u）int16应该足够了。你的真实矩阵是稀疏的吗？如果是这样，您最好使用

scipy.sparse.csr\u matrix

。您能为您试图解决的总体问题提供更多的上下文吗？将大整数矩阵相乘是一件非常不寻常的事情。更多地了解这些矩阵的性质将特别有用。值是否始终为0或1？如果它们可以更大，那么您可能会发现自己最终会被使用uint64表示的最大整数所约束。矩阵是如何生成的？它们是否有任何特殊的结构（例如对称、块、带等）？它可以工作！这是

O（n**3）

矩阵积，它精确地做

n**2

点积，对吗？它将矩阵积分解为许多较小的矩阵积。在极端情况下，这可能是矢量点积。当类型为浮点时，pardot比np.dot慢：并行运行4个作业并行运行8个作业pardot:0.13秒占用np.dot:0.07秒占用情况更糟当数据集是10倍大小时：pardot:1212.89秒占用np.dot:73.11秒taken@kory这是预期。请使用

np.dot

进行浮点乘法。

import numpy as np
import time

def timeit(callable):
    start = time.time()
    callable()
    end = time.time()
    return end - start

a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)

timeit(lambda: a.dot(a))  # ≈0.9 sec
timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) )  # ≈0.01 sec