Python Numbapro:矩阵乘法没有加速

Python Numbapro:矩阵乘法没有加速,python,numpy,cuda,matrix-multiplication,Python,Numpy,Cuda,Matrix Multiplication,在过去的几天里,我一直在试图理解为什么Numbapro(由Continuum Analytics,Inc.提供加速服务;我正在运行一个30天的试用版)不能在我的MacBook Pro上加速(Intel Core i7,2.6GHz,16GB RAM,NVIDIA GeForce GT 650M,PCI总线上1GB) 我从(NxM)x(MxN)矩阵乘法的代码中选取了一个例子,Continuum Analytics,Inc.声称通过CUDA加速计算,我比较了CUDA.JIT和numpy之间的时间。我

在过去的几天里,我一直在试图理解为什么Numbapro(由Continuum Analytics,Inc.提供加速服务;我正在运行一个30天的试用版)不能在我的MacBook Pro上加速(Intel Core i7,2.6GHz,16GB RAM,NVIDIA GeForce GT 650M,PCI总线上1GB)

我从(NxM)x(MxN)矩阵乘法的代码中选取了一个例子,Continuum Analytics,Inc.声称通过CUDA加速计算,我比较了CUDA.JIT和numpy之间的时间。我的想法是运行例如1e4次迭代,每次迭代矩阵B都是随机的。下面是我使用的代码,我引用了我获得的次数。有什么解决办法吗?谢谢

from numbapro import *
from numba import *
import numpy as np
import math
from timeit import default_timer as timer

m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])

iterations = 10000

start = timer()
for i in range(iterations):
    B = np.array(np.random.random((m,n)), dtype=np.float32)
    X=np.dot(A,B)
numpy_time=(timer() - start)

@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    x = tx + bx * bw
    y = ty + by * bh
    n = C.shape[0]

    if x >= n or y >= n:
        return

    cs = 0
    for i in range(n):
        cs += A[y,i]*B[i,x]
    C[y,x]= cs

    cuda.syncthreads()

blockdim = 256,3
griddim = 10,3

stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)

start = timer()    
for i in range(iterations):
    B = np.array(np.random.random((m,n)), dtype=np.float32)
    dB = cuda.to_device(B, stream)
    cu_square_matrix_mul[griddim,blockdim,stream](dA, dB, dC) 
    dC.to_host()
    stream.synchronize()
cuda_time = (timer() - start)    

print
print("Numpy took    %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))
结果:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor:  Continuum Analytics, Inc.
Package: numbapro
Message: trial mode expires in 30 days

Numpy took    378.328881 seconds
CUDA JIT took 342.723757 seconds, 1.10389x speedup

这是GPU上一个完全幼稚的矩阵乘法例程,而numpy例程实际上是一个库调用:

X=np.dot(A,B)
可能会进行高度优化。我对GPU的速度印象深刻


“解决方案”是为矩阵多应用程序设计,而不是编写自己的内核。

总时间有多少用于(1)生成随机矩阵B[](2)CPU->GPU->CPU之间的数据传输?我不熟悉
numpy
,但是对
stream.synchronize()
的调用表明,代码正在积极抑制GPU和CPU工作之间的重叠,以及从GPU到GPU的数据拷贝和内核执行之间的重叠,也就是说,所有内容都完全同步执行。从numbapro import*导入的
名称将被下一次从numba import*
导入的
覆盖。这是有意的吗?