大图像的Cuda分配/存储时间_C_Time_Cuda_Allocation

大图像的Cuda分配/存储时间

c time cuda

大图像的Cuda分配/存储时间,c,time,cuda,allocation,C,Time,Cuda,Allocation,我正在用CUDA进行图像处理。根据我的时间安排，分配时间最长。一个大映像需要0.00908秒来分配数据并将其复制到gpu内存中这是正常的时间吗？我做错什么了吗 clock_t t = clock(); float * dData; unsigned int nBytes = width*height*sizeof(float); cudaMalloc( (void**)&dData, nBytes ); cudaMemcpy( dData, Data, nByt

我正在用CUDA进行图像处理。根据我的时间安排，分配时间最长。一个大映像需要0.00908秒来分配数据并将其复制到gpu内存中

这是正常的时间吗？我做错什么了吗

  clock_t t = clock();  
  float * dData;
  unsigned int nBytes = width*height*sizeof(float);
  cudaMalloc( (void**)&dData, nBytes );
  cudaMemcpy( dData, Data, nBytes, cudaMemcpyHostToDevice );
  t = clock()-t;
  printf( "Allocation to device: %f\n", ((float)t/CLOCKS_PER_SEC) );

确保您是在版本中编译，而不是调试。值以JEDEC为单位

#include <stdio.h>
#include <cuda.h>
// main routine
int main ()
{
    float time;
    cudaEvent_t start, stop;

    for(size_t size=32; size<1024*1024*1024; size*=2){
          float* d_Data;
          float* h_Data = new float[size];


          cudaEventCreate(&start);
          cudaEventCreate(&stop);
          cudaEventRecord(start, 0);

          cudaMalloc( (void**)&d_Data, size*sizeof(float) );
          cudaMemcpy( d_Data, h_Data, size, cudaMemcpyHostToDevice );

        cudaDeviceSynchronize();
        cudaEventRecord(stop, 0);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&time, start, stop);

          if(size>1024*1024){
              printf( "Allocation to device: %fms with size %dMB\n", time, (size*sizeof(float))/(1024*1024) );
          }else if(size>1024){
              printf( "Allocation to device: %fms with size %dKB\n", time, (size*sizeof(float))/1024);
          }else{
              printf( "Allocation to device: %fms with size %dB\n", time, size*sizeof(float) );
          }
          delete[] h_Data;
          cudaFree(d_Data);
    }

    return 0;
}

在K20x和8芯常春藤桥Xeon上，如果没有关于图片大小和使用的GPU的更多信息，我的水晶球无法检索信息…告诉你的水晶球：图像大约为14兆字节。GPU是GTX880MYou，在上面的示例中测量分配和复制的时间。您上面写的结果是来自示例，还是您真的只测量了分配？如果您正在处理多个映像，请使用异步方法重叠映像的通信和执行。我看不出你的代码在结构上有什么问题。我通常不使用

clock（）

，但它可能对您有用-它的行为依赖于平台。14 MB的数据传输时间应该少于3毫秒，而您测量的是9毫秒，因此这表明

cudamaloc

需要6毫秒，这似乎很长。我在linux系统上运行了一个稍微修改过的代码版本进行测试，得到了~4ms。示例是确保执行一些其他cuda操作，例如

cudaFree（0）开始计时之前。
Allocation to device: 0.017504ms with size 128B
Allocation to device: 0.012608ms with size 256B
Allocation to device: 0.462656ms with size 512B
Allocation to device: 0.386432ms with size 1024B
Allocation to device: 0.492512ms with size 2048B
Allocation to device: 0.409568ms with size 4096B
Allocation to device: 0.419648ms with size 8KB
Allocation to device: 0.402144ms with size 16KB
Allocation to device: 0.562240ms with size 32KB
Allocation to device: 0.460480ms with size 64KB
Allocation to device: 0.409376ms with size 128KB
Allocation to device: 0.492864ms with size 256KB
Allocation to device: 0.611424ms with size 512KB
Allocation to device: 0.577376ms with size 1024KB
Allocation to device: 0.722240ms with size 2048KB
Allocation to device: 1.174336ms with size 4096KB
Allocation to device: 0.995552ms with size 8MB
Allocation to device: 2.030592ms with size 16MB
Allocation to device: 3.876384ms with size 32MB
Allocation to device: 7.414432ms with size 64MB
Allocation to device: 15.325792ms with size 128MB
Allocation to device: 31.763008ms with size 256MB
Allocation to device: 65.624481ms with size 512MB
Allocation to device: 133.767838ms with size 1024MB
Allocation to device: 272.001404ms with size 2048MB