Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/c/64.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
并发流中的CUDA cuFFT API行为_C_Parallel Processing_Cuda_Gpgpu_Cufft - Fatal编程技术网

并发流中的CUDA cuFFT API行为

并发流中的CUDA cuFFT API行为,c,parallel-processing,cuda,gpgpu,cufft,C,Parallel Processing,Cuda,Gpgpu,Cufft,我正在使用CUDA 7.0和nVidia 980 GTX进行一些图像处理。在特定迭代中,通过15-20个内核调用和多个cuFFT FFT/IFFT API调用独立处理多个分片 因此,我将每个磁贴放在它自己的CUDA流中,以便每个磁贴相对于主机异步执行它的操作字符串。在一次迭代中,每个图块的大小相同,因此它们共享一个cuFFT计划。主机线程通过命令快速移动,试图让GPU加载工作。我正在经历一个周期性的竞争条件,而这些操作正在并行处理,但我有一个关于cuFFT特别的问题。如果我使用tile 0的cu

我正在使用CUDA 7.0和nVidia 980 GTX进行一些图像处理。在特定迭代中,通过15-20个内核调用和多个cuFFT FFT/IFFT API调用独立处理多个分片

因此,我将每个磁贴放在它自己的CUDA流中,以便每个磁贴相对于主机异步执行它的操作字符串。在一次迭代中,每个图块的大小相同,因此它们共享一个cuFFT计划。主机线程通过命令快速移动,试图让GPU加载工作。我正在经历一个周期性的竞争条件,而这些操作正在并行处理,但我有一个关于cuFFT特别的问题。如果我使用tile 0的cuFFTSetStream()在流0中放置一个cuFFT计划,并且在主机将共享cuFFT计划的流设置为tile 1的流1之前,tile 0的FFT还没有在GPU上实际执行,那么cuFFTExec()在GPU上的行为是什么

更简洁地说,在调用cufftExec()时,对cufftExec()的调用是否会在计划设置为的流中执行,而不管是否使用cuFFTSetStream()在之前的FFT调用实际开始/完成之前更改后续分幅的流


很抱歉没有发布代码,但我无法发布我的实际源代码。

编辑:如评论中所述,如果相同的计划(相同创建的句柄)用于通过流在同一设备上同时执行FFT,则。这个问题似乎关注于流行为本身,我剩下的答案也关注于此,但这是一个重要的观点

如果我使用tile 0的cuFFTSetStream()在流0中放置一个cuFFT计划,并且在主机将共享cuFFT计划的流设置为tile 1的流1之前,tile 0的FFT还没有在GPU上实际执行,那么cuFFTExec()在GPU上的行为是什么

让我假设您说的是流1和流2,这样我们就可以避免围绕空流的任何可能的混淆

CUFFT应遵守在通过
cufftExecXXX()
将计划传递给CUFFT时为计划定义的流。通过
cufftSetStream()
对计划的后续更改不应影响先前发出的
cufftExecXXX()
调用所使用的流

我们可以使用分析器通过一个相当简单的测试来验证这一点。考虑下面的测试代码:

$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>

#define DSIZE 1048576
#define BATCH 100

int main(){

  const int nx = DSIZE;
  const int nb = BATCH;
  size_t ws = 0;
  cufftHandle plan;
  cufftResult res = cufftCreate(&plan);
  assert(res == CUFFT_SUCCESS);
  res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
  assert(res == CUFFT_SUCCESS);
  cufftComplex *d;
  cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
  cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
  cudaStream_t s1, s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  res = cufftSetStream(plan, s1);
  assert(res == CUFFT_SUCCESS);
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  res = cufftSetStream(plan, s2);
  assert(res == CUFFT_SUCCESS);
  nvtxMarkA("plan stream change");
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  cudaDeviceSynchronize();
  return 0;
}


$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
我们看到每个FFT操作需要3个内核调用。在这两者之间,我们看到我们的nvtx标记指示何时发出计划流更改请求,毫不奇怪这发生在前3个内核启动之后,但在最后3个内核启动之前。最后,我们注意到,实际上所有的执行时间都被最后的
cudaDeviceSynchronize()
调用占用。前面的所有调用都是异步的,因此在执行的第一毫秒内或多或少地“立即”执行。最终同步将占用6个内核的所有处理时间,总计约150毫秒

因此,如果
cufftstream
cufftExecC2C()
调用的第一次迭代产生影响,我们希望看到前3个内核中的部分或全部启动到与后3个内核相同的流中。但是当我们查看
nvprof--print gpu trace
输出时:

$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
974.74ms  7.3440ms                    -               -         -         -         -  800.00MB  106.38GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
982.09ms  23.424ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s  21.172ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s  27.551ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s  23.592ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s  21.157ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s  27.913ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
$nvprof——打印gpu跟踪。/t1089
==3757==NVPROF正在分析进程3757,命令:./t1089
==3757==分析应用程序:./t1089
==3757==分析结果:
开始持续时间网格大小块大小Regs*SSMem*DSMem*大小吞吐量设备上下文流名称
974.74ms 7.3440ms----800.00MB 106.38GB/s Quadro 5000(0)17[CUDA内存集]
982.09ms 23.424ms(25600 2 1)(32 8 1)32 8.0000KB 0B--Quadro 5000(0)1 13 void spRadix0064B::kernel1Mem(内核参数)[416]
1.00551s 21.172ms(25600 2 1)(32 8 1)32 8.0000KB 0B--Quadro 5000(0)1 13 void spRadix0064B::kernel1Mem(内核参数)[421]
1.02669s 27.551ms(25600 1)(16 16 1)61 17.000KB 0B--Quadro 5000(0)1 13 void spRadix0256B::kernel3Mem(内核参数)[426]
1.05422s 23.592ms(25600 2 1)(32 8 1)32 8.0000KB 0B--Quadro 5000(0)14 void spRadix0064B::kernel1Mem(内核参数)[431]
1.07781s 21.157ms(25600 2 1)(32 8 1)32 8.0000KB 0B--Quadro 5000(0)14 void spRadix0064B::kernel1Mem(内核参数)[436]
1.09897s 27.913ms(25600 1)(16 16 1)61 17.000KB 0B--Quadro 5000(0)14 void spRadix0256B::kernel3Mem(内核参数)[441]
Regs:每个CUDA线程使用的寄存器数。这个数字包括CUDA驱动程序和/或工具在内部使用的寄存器,可能比编译器显示的要多。
SSMem:每个CUDA块分配的静态共享内存。
DSTEM:为每个CUDA块分配的动态共享内存。
$
我们看到,实际上,前3个内核被发布到第一个流中,最后3个内核被发布到第二个流中,就像请求的那样。(正如api跟踪输出所建议的那样,所有内核的总执行时间大约为150ms。)由于底层内核启动是异步的,并且是在返回
cufftExecC2C()
调用之前发出的,如果您仔细考虑这一点,就会得出结论
$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
974.74ms  7.3440ms                    -               -         -         -         -  800.00MB  106.38GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
982.09ms  23.424ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s  21.172ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s  27.551ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s  23.592ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s  21.157ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s  27.913ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$