定时CUDA内核，该内核应执行1次以上_Cuda

定时CUDA内核，该内核应执行1次以上

cuda

定时CUDA内核，该内核应执行1次以上,cuda,Cuda,我想计算运行超过1次的内核的时间，要处理的数据对于每个正在执行的内核是不同的。我的代码如下，因为cudaMemcpy的时间不应该被计算 1 cudaEvent_t start; 2 error = cudaEventCreate(&start); 3 cudaEvent_t stop; 4 error = cudaEventCreate(&stop); 6 float msecTotal = 0.0f; 7 int nIter = 300; 8 for (int j = 0; j

我想计算运行超过1次的内核的时间，要处理的数据对于每个正在执行的内核是不同的。我的代码如下，因为cudaMemcpy的时间不应该被计算

1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {            
10      cudaMemcpy(...);
        // Record the start event
11      error = cudaEventRecord(start, NULL);
12      matrixMulCUDA1<<< grid, threads >>>(...);
       // Record the stop event
13      error = cudaEventRecord(stop, NULL);
14      error = cudaEventSynchronize(stop);
15      float msec = 0.0f;
16      error = cudaEventElapsedTime(&msec, start, stop);
17      msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;

1 cudaEvent\u t启动；
2错误=cudaEventCreate（&start）；
3.停车；
4错误=cudaEventCreate（&stop）；
6浮点数msecTotal=0.0f；
7整数=300；
8表示（int j=0；j（…）；
//记录停止事件
13错误=cudaEventRecord（停止，空）；
14错误=cudaEventSynchronize（停止）；
15浮点毫秒=0.0f；
16错误=CUDAEVENTERVERSDTIME（&msec，start，stop）；
17毫秒的整数+=毫秒；
18 }
19 cout无论哪种方法，你都应该得到类似的结果。通过记录内核启动前后的事件，您肯定只测量了在内核中花费的时间，而不是在memcpy上花费的任何时间
我唯一的缺点是，通过在循环的每次迭代中调用cudaEventSynchronize（），您正在破坏CPU/GPU的并发性，而这实际上对获得良好的性能非常重要。如果必须分别对每个内核调用计时（而不是将for循环的nIter迭代放在内核调用周围，而不是整个操作），则可能需要分配更多的CUDA事件。如果您这样做，那么每个循环迭代不需要2个事件-您需要将操作括在两个事件中，并且记录每个循环迭代只需要一个CUDA事件。然后，任何给定内核调用的时间都可以通过在相邻记录的事件上调用cudaEventReleasedTime（）来计算
要记录N个事件之间的GPU时间，请执行以下操作：
cudaEvent_t事件[N+2]
cudaEventRecord( events[0], NULL ); // record first event
for (j = 0; j < nIter; j++ ) {
    // invoke kernel, or do something else you want to time
    // cudaEventRecord( events[j+1], NULL );
}
cudaEventRecord( events[j], NULL );
// to compute the time taken for operation i, call:
float ms;
cudaEventElapsedTime( &ms, events[i+1], events[i] );

cudaEventRecord（事件[0]，NULL）；//记录第一个事件
对于（j=0；j我想在GPU上比较两种算法的时间。一般的方法是在程序完成后再次执行，比如10次，所以平均值是，该方法来自项目“./NVIDIA_CUDA-5.0_Samples/C/0_Simple/matrixMul/matrixMul.cu”。您的回答是：“您需要将操作括在两个中间，并且每次循环迭代只需记录一个CUDA事件。然后，任何给定内核调用的时间都可以通过对相邻记录的事件调用CUDAEventReleasedTime（）来计算。”您能举个例子吗？因为我听不懂你的想法。谢谢！cudaEventReleasedTime（）返回两个记录事件之间的时间差。我编辑了答案以使其更清晰。但是对于你想要做的，我认为你不需要超过2个事件。谢谢。我比较了三种方法的时间，“/NVIDIA_CUDA-5.0_Samples/C/0_Simple/matrixMul/matrixMul.cu”，第一种是0.099ms，这是我自己的方法；第二个是ArchaeaSoftware提供的0.098ms；第三个是0.095ms，这是matrixMul.cu中的原始值
cudaEventRecord( events[0], NULL ); // record first event
for (j = 0; j < nIter; j++ ) {
    // invoke kernel, or do something else you want to time
    // cudaEventRecord( events[j+1], NULL );
}
cudaEventRecord( events[j], NULL );
// to compute the time taken for operation i, call:
float ms;
cudaEventElapsedTime( &ms, events[i+1], events[i] );