Asynchronous 尽管是异步的,CUDA流仍然阻塞

Asynchronous 尽管是异步的,CUDA流仍然阻塞,asynchronous,cuda,blocking,cuda-streams,Asynchronous,Cuda,Blocking,Cuda Streams,我正在实时处理一个视频流,我试图用GeForce GTX 960M处理这个视频流。(Windows 10与2013年相比,CUDA 8.0) 每一帧都必须被捕捉,稍微模糊,只要我能,我就需要对最近的10帧做一些艰苦的计算。 所以我需要以每秒30帧的速度捕获所有帧,我希望以每秒5帧的速度获得努力的结果 我的问题是,我无法保持捕获以正确的速度运行:似乎繁重的计算会降低帧捕获的速度,无论是在CPU级别还是在GPU级别。我错过了一些画面 我尝试了很多解决办法。没有一个有效: 我尝试在2个流上设置作业(下

我正在实时处理一个视频流,我试图用GeForce GTX 960M处理这个视频流。(Windows 10与2013年相比,CUDA 8.0)

每一帧都必须被捕捉,稍微模糊,只要我能,我就需要对最近的10帧做一些艰苦的计算。 所以我需要以每秒30帧的速度捕获所有帧,我希望以每秒5帧的速度获得努力的结果

我的问题是,我无法保持捕获以正确的速度运行:似乎繁重的计算会降低帧捕获的速度,无论是在CPU级别还是在GPU级别。我错过了一些画面

我尝试了很多解决办法。没有一个有效:

  • 我尝试在2个流上设置作业(下图):
    • 主机得到一帧
    • 第一个流(称为Stream2):cudaMemcpyAsync复制设备上的帧。然后,第一个内核执行基本的模糊计算。(在所附图像中,模糊显示为3.07秒和3.085秒的短槽,然后什么都没有…直到大部分完成)
    • 主机通过CudaEvent检查第二个流是否“可用”,并在可能的情况下启动它。实际上,流的可用性为1/2次尝试
    • 第二个流(称为Stream4):在内核(kernelCalcul_W2)中启动艰苦的计算,输出结果,并记录事件
  • 实际上,我写道:

    cudaStream_t  sHigh, sLow;
    cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
    cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
    
    cudaEvent_t event_1;
    cudaEventCreate(&event_1);
    
    if (frame has arrived)
    {
        cudaMemcpyAsync(..., sHigh);        // HtoD, to upload images in the GPU
        blur_Image <<<... , sHigh>>> (...)
        if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
        else printf("Event 2 not ready\n");
    }
    
    void hard_work( cudaStream_t sLow_)
    {
        kernelCalcul_W2<<<... , sLow_>>> (...);
        cudaMemcpyAsync(... the result..., sLow_); //DtoH
        cudaEventRecord(event_1, sLow_);    
    }
    
    cudaStream\u t sHigh,慢;
    cudaStreamCreateWithPriority(&sHigh,cudaStreamNonBlocking,priority_high);
    cudaStreamCreateWithPriority(慢,cudaStreamNonBlocking,优先级低);
    cudaEvent事件1;
    cudaEventCreate(&event_1);
    如果(帧已到达)
    {
    cudaMemcpyAsync(…,sHigh);//HtoD,用于在GPU中上载图像
    模糊图像(…)
    如果(cudaEventQuery(事件1)=cudaSuccess))努力工作(缓慢);
    else printf(“事件2未准备就绪\n”);
    }
    放弃艰苦的工作(不慢)
    {
    核计算2(…);
    cudaMemcpyAsync(…结果…,慢);//DtoH
    cudaEventRecord(事件1,缓慢事件);
    }
    
  • 我试着只使用一条流。这与上面的代码相同,但在启动“艰苦工作”时更改了1个参数。
    • 主机得到一帧
    • 流:cudaMemcpyAsync复制设备上的帧。然后,内核进行基本的模糊计算。然后,如果CudaEvent事件_1正常,我将启动艰苦工作,并添加一个事件_1以获得下一轮的状态。 事实上,这条小溪永远都是可以利用的:我从不落入“其他”部分
  • 这样,在艰苦工作运行时,我希望“缓冲”所有要复制的帧,而不会丢失任何帧。但我确实失去了一些:事实证明,每次我得到一帧并复制它时,事件1似乎没问题,所以我开始了艰苦的工作,只是很晚才得到下一帧

  • 我尝试将这两个流放在两个不同的线程中(在C中)。没有更好(甚至更糟)
  • 所以问题是:如何确保第一个流捕获所有帧? 我真的觉得不同的流阻塞了CPU

    我用OpenGL显示图像。它会干扰吗

    有什么办法可以改进吗? 非常感谢

    编辑: 按照要求,我在这里放了一个MCVE

    有一个参数可以调整(#define ADJUST)以查看发生了什么。基本上,主过程以异步模式发送CUDA请求,但它似乎阻止了主线程。正如您将在图中看到的,我每隔30毫秒就有一次“内存访问”(即捕获的图像),除非在进行艰苦工作时(然后,我就没有图像)

    最后一个细节:我正在使用CUDA7.5来运行这个。我试图安装8.0,但显然编译器仍然是7.5

    #define _USE_MATH_DEFINES 1
    #define _CRT_SECURE_NO_WARNINGS 1
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <Windows.h>
    
    #define ADJUST  400
    // adjusting this paramter may make the problem occur.
    // Too high => probably watchdog will stop the kernel
    // too low => probably the kernel will run smothly
    
    unsigned short * images_as_Unsigned_in_Host;
    unsigned short * Images_as_Unsigned_in_Device;
    unsigned short * camera;
    float * images_as_Output_in_Host;
    float *  Images_as_Float_in_Device;
    float * imageOutput_in_Device;
    
    unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
    unsigned long imagePixelSize;
    unsigned short lastImageFromCamera;
    
    
    cudaStream_t  s1, s2;
    cudaEvent_t event_2;
    clock_t timeRef;
    
    // Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
    // This kernel runs fast, and that's the point.
    __global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_, 
        unsigned long  imagePixelSize_, short blur_distance)
    {
        // we start from 'blur_distance' from the edge
        // p0 is the point we will calculate. p is a pointer which will move around for average
        unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
        unsigned long p = p0;
        unsigned short * us;
        if (p >= imagePixelSize_) return;
        unsigned long tot = 0;
        short a, b, n, k;
        k = 0;
        // p starts from the top edge and will move to the right-bottom
        p -= blur_distance + blur_distance * imageWidth_;
        us = Images_as_Unsigned_in_Device_ + p;
        for (a = 2 * blur_distance; a >= 0; a--)
        {
            for (b = 2 * blur_distance; b >= 0; b--)
            {
                n = *us;
                if (n > 0) { tot += n; k++; }
                us++;
            }
            us += imageWidth_ - 2 * blur_distance - 1;
        }
        if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
        else Images_as_Float_in_Device_[p0] = 128.f;
    }
    
    
    __global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
    {
        // point the pixel and crunch it
        unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
        if (p >= imagePixelSize_)   { return; }
        float result;
        long a, b, n, n0;
        float input;
        b = 3;
    
        // this is not the right algorithm (which is pretty complex). 
        // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
        for (n = 0; n < 10; n++)
        {
            n0 = slot - n;
            if (n0 < 0) n0 += totImages;
            input = inputImage[p + n0 * imagePixelSize_]; 
            for (a = 0; a < ADJUST ; a++)
                    result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
        }
        outputImage[p] = result;
    }
    
    
    void hard_work( cudaStream_t s){
    
        cudaError err;
        // launch the hard work
        printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
        kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
        err = cudaPeekAtLastError();
        if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
        else printf("running ok\n");
    
        // copy the result back to Host
        //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
        cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize, cudaMemcpyDeviceToHost, s);
        cudaEventRecord(event_2, s);
    }
    
    
    void createStorageSpace()
    {
        imageWidth = 640;
        imageHeight = 480;
        totNbOfImages = 300;
        imageSlot = 0;
        imagePixelSize = 640 * 480;
        lastImageFromCamera = 0;
    
        camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
        for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
        // storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
        images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
        images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
    
        cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
        cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
    
        cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
    
    
    
        int priority_high, priority_low;
        cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
        cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
        cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
        cudaEventCreate(&event_2);
    
    }
    
    void releaseMapFile()
    {
        cudaFree(Images_as_Unsigned_in_Device);
        cudaFree(Images_as_Float_in_Device);
        cudaFree(imageOutput_in_Device);
        free(images_as_Output_in_Host);
        free(camera);
    
        cudaStreamDestroy(s1);
        cudaStreamDestroy(s2);
        cudaEventDestroy(event_2);
    }
    
    void putImageCUDA(const void * data)
    {       
        // We put the image in a round-robin. The slot to put the image is imageSlot
        printf("\nDealing with image %d\n", imageSlot);
        // Copy the image in the Round Robin
        cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);
    
        // We will blur the image. Let's prepare the memory to get the results as floats
        cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) *  imagePixelSize, s1);
    
        // blur image
        blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                    Images_as_Float_in_Device + imageSlot * imagePixelSize,
                    imageWidth, imagePixelSize, 3);
    
    
        // launches the hard-work
        if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
        else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
    
        imageSlot++;
        if (imageSlot >= totNbOfImages) {
            imageSlot = 0;
        }
    }
    
    int main()
    {
        createStorageSpace();
        printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
    
        for (int i = 0; i < 10; i++)
        {
            putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
            Sleep(30);  // to simulate Camera
        }
        releaseMapFile();
        getchar();
    }
    
    定义使用数学定义1
    #定义\u CRT\u安全\u无\u警告1
    #包括
    #包括
    #包括
    #包括
    #定义并调整400
    //调整此参数可能会出现问题。
    //过高=>监视器可能会停止内核
    //太低=>内核可能会运行过度
    未签名短*图像作为主机中的未签名图像;
    无符号短*图像作为设备中的无符号图像;
    无符号短*摄像机;
    浮动*图像作为主机中的输出;
    浮动*图像作为\u浮动\u在\u设备中;
    浮点*图像输出在设备中;
    无符号短图像宽度、图像高度、TOTNBOFIGES图像、图像槽;
    无符号长图像像素大小;
    未签名的短lastImageFromCamera;
    cudaStream_t s1、s2;
    cudaEvent事件2;
    时钟时间参考;
    基本上,在图像的中间,我平均值。我去掉了背后的逻辑,使它更简单。
    //这个内核运行得很快,这就是关键所在。
    __全局\uuuu无效模糊图像(未签名的短*图像\u作为\u未签名的\u在\u设备中,浮动*图像\u作为\u浮动的\u在\u设备中,未签名的短图像宽度,
    无符号长图像像素(像素大小,短模糊距离)
    {
    //我们从边缘的“模糊距离”开始
    //p0是我们将要计算的点。p是一个指针,它将移动平均值
    无符号长p0=(threadIdx.x+模糊距离)+(blockIdx.x+模糊距离)*图像宽度;
    无符号长p=p0;
    无符号短*us;
    如果(p>=imagePixelSize_u3;)返回;
    无符号长tot=0;
    短a,b,n,k;
    k=0;
    //p从上边缘开始,并将移动到右底部
    p-=模糊距离+模糊距离*图像宽度;
    us=图像作为设备中的无符号图像;
    对于(a=2*模糊距离;a>=0;a--)
    {
    对于(b=2*模糊距离;b>=0;b--)
    {
    n=*美国;
    如果(n>0){tot+=n;k++;}
    美国++;
    }
    us+=图像宽度_u2*模糊距离-1;
    }
    如果(k>0)图像作为设备中的浮点数,则[p0]=(Float)tot/(Float)k;
    else图像作为设备中的浮点值=128.f;
    }
    __全局uuu无效内核计算W2(浮点*输入图像,浮点*输出图像,无符号长图像像素大小uu,无符号短图像宽度uu,无符号短槽,无符号短图像)
    {
    //指向像素点并挤压它
    无符号长p=threadIdx.x+blockIdx.x*imageWidth;
    如果(p>=imagePixelSize_u2;{return;}
    浮动结果;
    
    $ cat t33.cu
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <unistd.h>
    
    #define ADJUST  400
    // adjusting this paramter may make the problem occur.
    // Too high => probably watchdog will stop the kernel
    // too low => probably the kernel will run smothly
    
    unsigned short * images_as_Unsigned_in_Host;
    unsigned short * Images_as_Unsigned_in_Device;
    unsigned short * camera;
    float * images_as_Output_in_Host;
    float *  Images_as_Float_in_Device;
    float * imageOutput_in_Device;
    
    unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
    unsigned long imagePixelSize;
    unsigned short lastImageFromCamera;
    
    
    cudaStream_t  s1, s2;
    cudaEvent_t event_2;
    clock_t timeRef;
    
    // Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
    // This kernel runs fast, and that's the point.
    __global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
        unsigned long  imagePixelSize_, short blur_distance)
    {
        // we start from 'blur_distance' from the edge
        // p0 is the point we will calculate. p is a pointer which will move around for average
        unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
        unsigned long p = p0;
        unsigned short * us;
        if (p >= imagePixelSize_) return;
        unsigned long tot = 0;
        short a, b, n, k;
        k = 0;
        // p starts from the top edge and will move to the right-bottom
        p -= blur_distance + blur_distance * imageWidth_;
        us = Images_as_Unsigned_in_Device_ + p;
        for (a = 2 * blur_distance; a >= 0; a--)
        {
            for (b = 2 * blur_distance; b >= 0; b--)
            {
                n = *us;
                if (n > 0) { tot += n; k++; }
                us++;
            }
            us += imageWidth_ - 2 * blur_distance - 1;
        }
        if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
        else Images_as_Float_in_Device_[p0] = 128.f;
    }
    
    
    __global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
    {
        // point the pixel and crunch it
        unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
        if (p >= imagePixelSize_)   { return; }
        float result;
        long a, n, n0;
        float input;
    
        // this is not the right algorithm (which is pretty complex).
        // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
        for (n = 0; n < 10; n++)
        {
            n0 = slot - n;
            if (n0 < 0) n0 += totImages;
            input = inputImage[p + n0 * imagePixelSize_];
            for (a = 0; a < ADJUST ; a++)
                    result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
        }
        outputImage[p] = result;
    }
    
    
    void hard_work( cudaStream_t s){
    #ifndef QUICK
        cudaError err;
        // launch the hard work
        printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
        kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
        err = cudaPeekAtLastError();
        if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
        else printf("running ok\n");
    
        // copy the result back to Host
        //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
        cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize/2, cudaMemcpyDeviceToHost, s);
        cudaEventRecord(event_2, s);
    #endif
    }
    
    
    void createStorageSpace()
    {
        imageWidth = 640;
        imageHeight = 480;
        totNbOfImages = 300;
        imageSlot = 0;
        imagePixelSize = 640 * 480;
        lastImageFromCamera = 0;
    #ifdef USE_HOST_ALLOC
        cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
        cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
        cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
    #else
        camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
        images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
        images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
    #endif
        for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
        cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
        cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
    
        cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
    
    
    
        int priority_high, priority_low;
        cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
        cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
        cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
        cudaEventCreate(&event_2);
        cudaEventRecord(event_2, s2);
    }
    
    void releaseMapFile()
    {
        cudaFree(Images_as_Unsigned_in_Device);
        cudaFree(Images_as_Float_in_Device);
        cudaFree(imageOutput_in_Device);
    
        cudaStreamDestroy(s1);
        cudaStreamDestroy(s2);
        cudaEventDestroy(event_2);
    }
    
    void putImageCUDA(const void * data)
    {
        // We put the image in a round-robin. The slot to put the image is imageSlot
        printf("\nDealing with image %d\n", imageSlot);
        // Copy the image in the Round Robin
        cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);
    
        // We will blur the image. Let's prepare the memory to get the results as floats
        cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) *  imagePixelSize, s1);
    
        // blur image
        blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                    Images_as_Float_in_Device + imageSlot * imagePixelSize,
                    imageWidth, imagePixelSize, 3);
    
    
        // launches the hard-work
        if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
        else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
    
        imageSlot++;
        if (imageSlot >= totNbOfImages) {
            imageSlot = 0;
        }
    }
    
    int main()
    {
        createStorageSpace();
        printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
    
        for (int i = 0; i < 10; i++)
        {
            putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
            usleep(30000);  // to simulate Camera
        }
        cudaError_t err = cudaGetLastError();
        if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
        releaseMapFile();
    }
    $ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
    $ time ./t33
    The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
    You may adjust a #define ADJUST parameter to see what's happening.
    Dealing with image 0
    Hard work is launched after image 0 is captured  ==> running ok
    
    Dealing with image 1
    Hard work is launched after image 1 is captured  ==> running ok
    
    Dealing with image 2
    Hard work is launched after image 2 is captured  ==> running ok
    
    Dealing with image 3
    Hard work is launched after image 3 is captured  ==> running ok
    
    Dealing with image 4
    Hard work is launched after image 4 is captured  ==> running ok
    
    Dealing with image 5
    Hard work is launched after image 5 is captured  ==> running ok
    
    Dealing with image 6
    Hard work is launched after image 6 is captured  ==> running ok
    
    Dealing with image 7
    Hard work is launched after image 7 is captured  ==> running ok
    
    Dealing with image 8
    Hard work is launched after image 8 is captured  ==> running ok
    
    Dealing with image 9
    Hard work is launched after image 9 is captured  ==> running ok
    
    real    0m2.790s
    user    0m0.688s
    sys     0m0.966s
    $ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
    $ time ./t33
    The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
    You may adjust a #define ADJUST parameter to see what's happening.
    Dealing with image 0
    Hard work is launched after image 0 is captured  ==> running ok
    
    Dealing with image 1
    Hard_work still running, so unable to process after image 1
    
    Dealing with image 2
    Hard_work still running, so unable to process after image 2
    
    Dealing with image 3
    Hard_work still running, so unable to process after image 3
    
    Dealing with image 4
    Hard_work still running, so unable to process after image 4
    
    Dealing with image 5
    Hard_work still running, so unable to process after image 5
    
    Dealing with image 6
    Hard_work still running, so unable to process after image 6
    
    Dealing with image 7
    Hard work is launched after image 7 is captured  ==> running ok
    
    Dealing with image 8
    Hard_work still running, so unable to process after image 8
    
    Dealing with image 9
    Hard_work still running, so unable to process after image 9
    
    real    0m1.721s
    user    0m0.028s
    sys     0m0.629s
    $