Fortran上的CUDA矩阵乘法比C慢_Cuda_Fortran

Fortran上的CUDA矩阵乘法比C慢

cuda fortran

Fortran上的CUDA矩阵乘法比C慢,cuda,fortran,Cuda,Fortran,我正在使用CUDA Fortran和C执行基本矩阵乘法，没有任何优化。Fortran和C都在做完全相同的事情，但Fortran的执行时间较慢 C内核 #define idx(x,y,z) x*y + z __global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) { //Global inidices int tx = blockIdx.y * blockDim.y + thre

我正在使用CUDA Fortran和C执行基本矩阵乘法，没有任何优化。Fortran和C都在做完全相同的事情，但Fortran的执行时间较慢

C内核

#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
//Call  
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();

Fortran内核

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

    integer :: tx, ty, k
    real*8 :: accum
    real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
    
    tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
    ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

    if (tx <= SIZE_ .and. ty <=SIZE_) then
        accum = 0.0
        do k=1, SIZE_
            accum = accum + d_mat(tx,k) * d_matT(k,ty)
        end do
        d_matSym(tx,ty) = accum
    end if
end subroutine matrixMultiply

!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

不同之处在于C使用1D数组，而Fortran使用2D数组。但这应该不是问题，因为内存下面是连续的

如果是内存访问，那么在这两种情况下，K循环连续访问一个矩阵，另一个访问按大小跳跃

两者产生相同的结果

对于16384 x 16384矩阵， C:5.4秒 Fortran语言：6.3秒

GPU:Tesla V100 32GB

我不确定我做错了什么？

首先，我建议性能问题包括完整的代码。我通常需要能够运行的东西，你可以节省我一些打字。当然，你可以把东西删掉。当然，我大概能猜出是什么。但我不太可能以这种方式帮助你，我怀疑我不是唯一一个这样认为的人。我的建议是：让别人更容易帮助你。下面我给出了一些有用的例子

关于这个问题：

不同之处在于C使用1D数组，而Fortran使用2D数组。但这应该不是问题，因为内存下面是连续的

TL；DR：你的说法不应该成为问题，这显然是不可支持的。1D分配和2D分配之间的差异不仅从存储角度，而且从索引计算角度来看都很重要。如果你对这个答案的长度很敏感，请跳到这篇文章底部的注释D

详情：

当我们有这样一个循环：

    for (k=0; k<SIZE; ++k) {
        accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
    }

NVIDIA HPC SDK，2021.2，特斯拉V100

我们看到，执行时间约为6.3秒，与您的报告一致

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

上述序列在展开的循环区域内重复。我们注意到有2个DFMA操作，4个LDG操作是有意义的-每乘2个，其余的指令4可能在循环进行迭代时构成索引更新

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

这个序列再次重复。我们看到2条DFMA和4条LDG指令，但在重复序列中没有其他指令。我们注意到LDG指令似乎有可以在编译时合并到指令中的偏移量，并且这些偏移量消除了对任何额外指令的需要，因为它们只是增加先前计算的偏移量。我们还注意到，在一种情况下，增量是按列偏移量增加的，另一个是行偏移量，就像我们希望在内核循环中获取乘法操作数一样

重构 < >我们可以重构FORTRAN代码，使其像C++代码一样工作吗？是的：

$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            d_matSym((ty-1)*SIZE_+tx) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  10.8169s         2  5.40847s  5.39118s  5.42576s  mmk_matrixmultiply_
                    0.00%  9.2160us         6  1.5360us  1.3120us  2.1440us  [CUDA memcpy HtoD]
      API calls:   96.72%  10.8240s         9  1.20266s  12.102us  5.42594s  cudaFree
                    3.22%  360.34ms         9  40.038ms  3.2760us  355.64ms  cudaMalloc
                    0.04%  4.3488ms         4  1.0872ms  1.0598ms  1.1593ms  cuDeviceTotalMem
                    0.02%  2.3633ms       404  5.8490us     111ns  869.22us  cuDeviceGetAttribute
                    0.00%  144.50us         4  36.125us  33.254us  41.317us  cuDeviceGetName
                    0.00%  106.43us         6  17.737us  4.7910us  51.453us  cudaMemcpy
                    0.00%  101.46us         2  50.732us  44.479us  56.985us  cudaLaunchKernel
                    0.00%  20.926us         1  20.926us  20.926us  20.926us  cudaDeviceSynchronize
                    0.00%  7.2850us         4  1.8210us     491ns  4.6210us  cuDeviceGetPCIBusId
                    0.00%  1.7650us         8     220ns     118ns     672ns  cuDeviceGet
                    0.00%     926ns         4     231ns     218ns     264ns  cuDeviceGetUuid
                    0.00%     614ns         3     204ns     130ns     310ns  cuDeviceGetCount
$

D.我认为这种差异在编译时不容易解决的原因之一是，CUDA Fortran中的编译器2D数组处理（无论是否需要）似乎涉及创建运行时元数据包，其中包括数组宽度，可能还有其他信息。此元数据包与阵列数据有效负载分开传输到设备-可以从上面的探查器输出推测出这一点，或者通过运行-print gpu trace来更清楚地查看它。在CUDA C++的情况下，有效地，数组宽度大小在编译时是已知的。在Fortran的情况下，它不是，或者，如果是，编译器似乎没有通过内核代码生成来传播这些知识。您可以看到，上面改进的寻址模式表示编译时已知的偏移量。我不能说这是否应该通过更仔细地处理Fortran来解决。然而，我认为某种连续声明/修饰本身不足以解决编译器的这种差异。仅此声明并不能消除基于数组宽度的动态运行时索引计算的需要。

关于这个问题：

不同之处在于C使用1D数组，而Fortran使用2D数组。但这应该不是问题，因为内存下面是连续的

TL；DR：你的说法不应该成为问题，这显然是不可支持的。1D分配和2D分配之间的差异不仅从存储角度，而且从索引计算角度来看都很重要。如果你对这个答案的长度很敏感，请跳到bo的注释D 我是这篇文章的作者

详情：

当我们有这样一个循环：

    for (k=0; k<SIZE; ++k) {
        accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
    }

NVIDIA HPC SDK，2021.2，特斯拉V100

我们看到，执行时间约为6.3秒，与您的报告一致

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

上述序列在展开的循环区域内重复。我们注意到有2个DFMA操作，4个LDG操作是有意义的-每乘2个，其余的指令4可能在循环进行迭代时构成索引更新

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

重构 < >我们可以重构FORTRAN代码，使其像C++代码一样工作吗？是的：

$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            d_matSym((ty-1)*SIZE_+tx) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  10.8169s         2  5.40847s  5.39118s  5.42576s  mmk_matrixmultiply_
                    0.00%  9.2160us         6  1.5360us  1.3120us  2.1440us  [CUDA memcpy HtoD]
      API calls:   96.72%  10.8240s         9  1.20266s  12.102us  5.42594s  cudaFree
                    3.22%  360.34ms         9  40.038ms  3.2760us  355.64ms  cudaMalloc
                    0.04%  4.3488ms         4  1.0872ms  1.0598ms  1.1593ms  cuDeviceTotalMem
                    0.02%  2.3633ms       404  5.8490us     111ns  869.22us  cuDeviceGetAttribute
                    0.00%  144.50us         4  36.125us  33.254us  41.317us  cuDeviceGetName
                    0.00%  106.43us         6  17.737us  4.7910us  51.453us  cudaMemcpy
                    0.00%  101.46us         2  50.732us  44.479us  56.985us  cudaLaunchKernel
                    0.00%  20.926us         1  20.926us  20.926us  20.926us  cudaDeviceSynchronize
                    0.00%  7.2850us         4  1.8210us     491ns  4.6210us  cuDeviceGetPCIBusId
                    0.00%  1.7650us         8     220ns     118ns     672ns  cuDeviceGet
                    0.00%     926ns         4     231ns     218ns     264ns  cuDeviceGetUuid
                    0.00%     614ns         3     204ns     130ns     310ns  cuDeviceGetCount
$

为什么您认为自己做错了什么？为什么用不同语言编写并使用不同编译器编译的代码会产生相同的性能？分解生成的SAS并查找差异。也许您在这两种情况下未描述的编译器设置都不同。…@talonmies我同意您的看法，这里可能没有什么必要的错误。但是这个问题是正确的，因为Fortran在很多情况下通常与C一样快或比C更快，尽管差别通常很小@阿披实：我没有使用Cuda Fortran的经验。但是如果您没有得到问题的答案，假设您的编译器是nvfortran，请在英伟达论坛上问你的问题，并在这里发布你的答案：@ AbHeHekPurand Addia1297你是否多次重复你的例子并平均结果？FORTRAN使用2D：我有更多的数组维数经常使代码变慢的可能，因为更多的地址计算算法。好奇，将连续属性附加到此行real*8，维度：，：：d_mat，d_mat，d_matSym可能会有一些效果？我也认为英伟达论坛也许有用，你为什么认为你做错了什么？为什么用不同语言编写并使用不同编译器编译的代码会产生相同的性能？分解生成的SAS并查找差异。也许您在这两种情况下未描述的编译器设置都不同。…@talonmies我同意您的看法，这里可能没有什么必要的错误。但是这个问题是正确的，因为Fortran在很多情况下通常与C一样快或比C更快，尽管差别通常很小@阿披实：我没有使用Cuda Fortran的经验。但是如果您没有得到问题的答案，假设您的编译器是nvfortran，请在英伟达论坛上问你的问题，并在这里发布你的答案：@ AbHeHekPurand Addia1297你是否多次重复你的例子并平均结果？FORTRAN使用2D：我有更多的数组维数经常使代码变慢的可能，因为更多的地址计算算法。好奇，将连续属性附加到此行real*8，维度：，：：d_mat，d_mat，d_matSym可能会有一些效果？我还认为英伟达论坛可能是有用的，当编译器支持联合/映射时，它可以方便地作为一个2D彻底完成矩阵。但对于基本类型的操作，则将其称为1D ar

ray/vector。感谢您花时间完成此操作。我真的很高兴你仍然对这种深度的分析充满热情。我一般不会。与往常一样，Fortran指针、智能数组索引、切片、广播和算术等中的许多较新功能都有令人惊讶的贪婪实现，与较新的语言相比，这些实现侵蚀了传统Fortran的大部分性能优势……这是一个编译器不知道数组大小的问题。我还是Fortran新手，所以我错误地忽略了定义内核参数数组的大小。real*8，dimensionSIZE，SIZE：：d_mat，d_matT，d_matSym声明修复了该问题。我感谢你的时间和透彻的洞察力！当编译器支持UNION/MAP时，通常可以方便地将矩阵作为2D文件进行处理。但对于基本类型的操作，请将其称为1D数组/向量。感谢您花时间完成此操作。我真的很高兴你仍然对这种深度的分析充满热情。我一般不会。与往常一样，Fortran指针、智能数组索引、切片、广播和算术等中的许多较新功能都有令人惊讶的贪婪实现，与较新的语言相比，这些实现侵蚀了传统Fortran的大部分性能优势……这是一个编译器不知道数组大小的问题。我还是Fortran新手，所以我错误地忽略了定义内核参数数组的大小。real*8，dimensionSIZE，SIZE：：d_mat，d_matT，d_matSym声明修复了该问题。我感谢你的时间和透彻的洞察力！