Cuda 在OpenACC中使用缓存_Cuda_Gpu_Shared_Openacc

Cuda 在OpenACC中使用缓存

cuda

Cuda 在OpenACC中使用缓存,cuda,gpu,shared,openacc,Cuda,Gpu,Shared,Openacc,我正在尝试使用$Laplace 2D解算器内特定循环的acc缓存。当我用-Mcuda=ptxinfo分析代码时，它显示没有使用共享内存（smem），但代码运行速度比基本条件慢以下是代码的一部分： !$acc parallel loop reduction(max:error) num_gangs(n/THREADS) vector_length(THREADS) do j=2,m-1 do i=2,n-1 #ifdef SHARED !$acc ca

我正在尝试使用$Laplace 2D解算器内特定循环的acc缓存。当我用-Mcuda=ptxinfo分析代码时，它显示没有使用共享内存（smem），但代码运行速度比基本条件慢

以下是代码的一部分：

  !$acc parallel loop reduction(max:error) num_gangs(n/THREADS) vector_length(THREADS)
  do j=2,m-1
    do i=2,n-1
      #ifdef SHARED
        !$acc cache(A(i-1:i+1,j),A(i,j-1:j+1))
      #endif
      Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) )
      error = max( error, abs( Anew(i,j) - A(i,j) ) )
    end do
  end do
 !$acc end parallel

这是使用的输出$acc缓存

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 28 registers, 96 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 96 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 64 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 37 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 39 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 37 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 30 registers, 352 bytes cmem[0]

这是不带缓存的输出：

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 23 registers, 88 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 88 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 64 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 36 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 39 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_39_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_39_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info    : Function properties for acc_lap2d_58_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 30 registers, 352 bytes cmem[0]

它还通过-Minfo=accel显示缓存了一定数量的内存：

acc_lap2d:
     17, Generating copy(a(:4096,:4096))
         Generating create(anew(:4096,:4096))
     39, Accelerator kernel generated
         Generating Tesla code
         39, Max reduction generated for error
         40, !$acc loop gang(256) ! blockidx%x
         41, !$acc loop vector(16) ! threadidx%x
             Cached references to size [(x)x3] block of a
         Loop is parallelizable
     58, Accelerator kernel generated
         Generating Tesla code
         59, !$acc loop gang ! blockidx%x
         60, !$acc loop vector(128) ! threadidx%x
         Loop is parallelizable

我想知道如何在OpenACC中有效地使用缓存（CUDA意义上的共享内存）

非常感谢你的帮助

Behzad

编译器应该将此标记为错误。不能在同一缓存指令中两次列出同一变量。由于我为PGI工作，我添加了一份技术问题报告（TPR#21898），要求我们检测此错误。尽管在当前的OpenACC规范中并不特别违法，但我们将向标准委员会提出。问题是编译器无法判断在哪种情况下使用两个缓存数组中的哪一个

解决方法是将两个参考结合起来：

!$acc cache(A(i-1:i+1,j-1:j+1))

请注意，PTX信息不会显示共享内存使用情况，因为它只显示固定大小的共享内存。我们在启动CUDA内核时动态调整共享内存大小。在查看生成的CUDAC代码（-ta=tesla:nollvm，keep）时，我看到共享内存引用正在生成

还要注意，使用共享内存并不能保证更好的性能。填充共享数组会有开销，生成的内核需要同步线程。除非有大量的重用，“缓存”可能没有好处

如果PGI编译器可以通过分析或在声明为“INTENT（IN）”时确定数组为“只读”，并且我们的目标设备的计算能力为3.5或更高，那么我们将尝试使用纹理内存。在这种情况下，将“A”放在纹理内存中可能更有益

希望这有帮助，

Mat

您正在使用哪一版本的PGI工具？非常感谢您的回复Mat。这对我帮助很大。因此，既然openacc（syncthreads（），等等）中没有显式控制共享内存的语法，我怎么能找出使用这种方法呢$acc缓存会有帮助吗？我可以通过查看由“-ta=tesla:nollvm，keep”生成的.gpu文件来找到它吗？或者我应该做些别的事情？非常感谢。另外，在使用纹理内存的情况下，我如何将数组“A”放入openacc中的纹理内存？嗨，Behazd，关于如何确定使用“缓存”是否有帮助，我没有一个固定的规则。实现“缓存”对编译器来说是非常困难的。虽然PGI在我们的15.x版本中使用“缓存”做得更好，但仍有一些情况需要改进。因此，您可能有一个有效的用例，但编译器没有尽可能地优化它。尽管在使用“缓存”时，请确保有足够的重用，并且编译器正在生成纹理内存的缓存引用（见PGI编译器反馈消息via-Minfo=accel），因为这是特定设备的一项功能，它不是OpenACC标准的一部分。相反，您需要依靠编译器来利用该特性。PGI，以及Cray和Pathscale。您只需要尝试让编译器更容易判断数据是只读的。对于Fortran，这意味着使用“INTENT（IN）”，对于C/C++则使用const和restrict关键字。