使用Cuda使用python中的numba在GPU上创建数组

使用Cuda使用python中的numba在GPU上创建数组,python,cuda,gpu,numba,Python,Cuda,Gpu,Numba,我想计算网格中每个点的函数。问题是,如果我在CPU端创建网格,将其传输到GPU所需的时间比实际计算时间长。我可以在GPU端生成网格吗 下面的代码显示了CPU端网格的创建和GPU端大部分表达式的计算(我不确定如何让atan2在GPU上工作,所以我把它留在了CPU端)。我应该提前道歉,并说我仍在学习这些东西,所以我确信下面的代码还有很大的改进空间 谢谢 import math from numba import vectorize, float64 import numpy as np from t

我想计算网格中每个点的函数。问题是,如果我在CPU端创建网格,将其传输到GPU所需的时间比实际计算时间长。我可以在GPU端生成网格吗

下面的代码显示了CPU端网格的创建和GPU端大部分表达式的计算(我不确定如何让atan2在GPU上工作,所以我把它留在了CPU端)。我应该提前道歉,并说我仍在学习这些东西,所以我确信下面的代码还有很大的改进空间

谢谢

import math
from numba import vectorize, float64
import numpy as np
from time import time

@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
    return  (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1), 
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
    return earthdiam_nm * np.arctan2(a,1-a)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))

start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))

让我们建立一个性能基线。为
earthdiam\u nm
添加定义(1.0),并在
nvprof
下运行您的代码,我们有:

$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.58%  11.418ms         4  2.8544ms  2.6974ms  3.3044ms  [CUDA memcpy HtoD]
                   28.59%  5.8727ms         1  5.8727ms  5.8727ms  5.8727ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   15.83%  3.2521ms         1  3.2521ms  3.2521ms  3.2521ms  [CUDA memcpy DtoH]
(...)
除了将值30和101传递给每个“worker”之外,实际上什么都没有做。我在这里使用的“worker”指的是在大数据集上“广播”
矢量化
函数的numba过程中的特定标量计算。numba矢量化/广播过程不要求每个输入都是相同大小的数据集,只要求提供的数据可以“广播”。因此,可以创建一个
矢量化
ufunc,它可以处理数组和标量。这意味着每个worker将使用其数组元素加上标量来执行其计算

因此,最简单的方法是删除这两个数组,并将值(30101)作为标量传递给ufunc
a_cuda
。在我们追求“低挂果实”的同时,让我们将您的
arctan2
计算(替换为
math.atan2
)和您的最终缩放比例
earthdiam_nm
合并到矢量化代码中,这样我们就不必在python/numpy主机上执行此操作:

$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
    a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
    return math.atan2(a, 1-a)*s

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.12%  8.4679ms         1  8.4679ms  8.4679ms  8.4679ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   33.97%  5.9774ms         5  1.1955ms     864ns  3.2535ms  [CUDA memcpy HtoD]
                   17.91%  3.1511ms         4  787.77us  1.1840us  3.1459ms  [CUDA memcpy DtoH]
(snip)
现在我们看到:

  • 内核运行时间甚至更长,大约为10ms(因为我们正在进行网格生成)
  • 没有将数据从主机显式复制到设备
  • 整个函数运行时间已从~0.4s减少到~0.3s

  • 我有点困惑。问题中的代码中没有Numba CUDA代码。当然,这里的目标是CUDA,所以如果我能遵循这个范例,那就太好了。如果需要,@cuda.jit装饰器是一个选项。。我只是不知道最简单的方法是什么。在哪里定义了
    earthdiam\u nm
    呢?效果很好-谢谢!作为说明,我添加了此选项以将生成的网格显示为灰度图像:from Pillow import image array=Zh.reformate(nx,ny)img=image.fromarray(array,'F')img.show()`
    $ cat t39.py
    import math
    from numba import vectorize, float64
    import numpy as np
    from time import time
    earthdiam_nm = 1.0
    @vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
    def a_cuda(lat1, lon1, lat2, lon2, s):
        a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
                 math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
        return math.atan2(a, 1-a)*s
    
    def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
        return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
                   np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)
    
    # generate a mesh of one million evaluation points
    nx, ny = 1000,1000
    xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
    X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
    # X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
    start = time()
    Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
    print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
    #print(Z)
    $ nvprof python t39.py
    ==2387== NVPROF is profiling process 2387, command: python t39.py
    1000000 total evaluations in 0.401 seconds
    ==2387== Profiling application: python t39.py
    ==2387== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   48.12%  8.4679ms         1  8.4679ms  8.4679ms  8.4679ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                       33.97%  5.9774ms         5  1.1955ms     864ns  3.2535ms  [CUDA memcpy HtoD]
                       17.91%  3.1511ms         4  787.77us  1.1840us  3.1459ms  [CUDA memcpy DtoH]
    (snip)
    
    $ cat t40.py
    import math
    from numba import vectorize, float64, cuda
    import numpy as np
    from time import time
    
    earthdiam_nm = 1.0
    
    @cuda.jit(device='true')
    def a_cuda(lat1, lon1, lat2, lon2, s):
        a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
                 math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
        return math.atan2(a, 1-a)*s
    
    @cuda.jit
    def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
        x,y = cuda.grid(2)
        if x < nx and y < ny:
            lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
            lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
            out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)
    
    nx, ny = 1000,1000
    Z = cuda.device_array((nx,ny), dtype=np.float64)
    threads = (32,32)
    blocks = (32,32)
    start = time()
    LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
    Zh = Z.copy_to_host()
    print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
    #print(Zh)
    $ nvprof python t40.py
    ==2855== NVPROF is profiling process 2855, command: python t40.py
    1000000 total evaluations in 0.294 seconds
    ==2855== Profiling application: python t40.py
    ==2855== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   75.60%  10.364ms         1  10.364ms  10.364ms  10.364ms  cudapy::__main__::LLA_distance_numba_cuda$241(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
                       24.40%  3.3446ms         1  3.3446ms  3.3446ms  3.3446ms  [CUDA memcpy DtoH]
    (...)