C++ 我的GPU加速opencv代码比普通opencv慢_C++_Opencv_Gpu

C++ 我的GPU加速opencv代码比普通opencv慢

c++ opencv

C++ 我的GPU加速opencv代码比普通opencv慢,c++,opencv,gpu,C++,Opencv,Gpu,我复制了《使用OpenCV和CUDA使用GPU加速计算机视觉》一书中的两个例子，以比较CPU和GPU的性能第1代码： cv:：Mat src=cv:：imread（“D:/Pics/Pen.jpg”，0）；//Pen.jpg是一张4096*4096的灰度图片。 cv：：Mat result_host1、result_host2、result_host3、result_host4、result_host5； //以毫秒为单位获取初始时间 int64 work_begin=getTickCount

我复制了《使用OpenCV和CUDA使用GPU加速计算机视觉》一书中的两个例子，以比较CPU和GPU的性能

第1代码：

cv:：Mat src=cv:：imread（“D:/Pics/Pen.jpg”，0）；//Pen.jpg是一张4096*4096的灰度图片。
cv：：Mat result_host1、result_host2、result_host3、result_host4、result_host5；
//以毫秒为单位获取初始时间
int64 work_begin=getTickCount（）；
cv:：threshold（src，result_host1，128.0，255.0，cv:：THRESH_二进制）；
cv:：threshold（src，result_host2，128.0，255.0，cv:：THRESH_BINARY_INV）；
cv:：threshold（src，result_host3，128.0，255.0，cv:：THRESH_TRUNC）；
cv:：threshold（src，result_host4，128.0，255.0，cv:：THRESH_TOZERO）；
cv:：threshold（src，result_host5，128.0，255.0，cv:：THRESH_to zero_INV）；
//工作结束后争取时间
int64 delta=getTickCount（）-开始工作；
//定时器频率
double freq=getTickFrequency（）；
双工作频率=频率/增量；
std:：cout我能想到两个原因，为什么即使没有内存操作，CPU版本也会更快：
1.在第2和第3个代码版本中，您声明了结果GpuMat，但没有实际初始化它们，通过调用GpuMat.create，结果GpuMat的初始化将在阈值方法内发生，这将导致每次执行80MB的GPU内存分配，您可以看到“性能改进”通过初始化结果gpumat一次，然后重用它们。
使用原始的第3个代码，我得到以下结果（Geforce RTX 2080）：
时间：0.010208
FPS:97.9624
当我将代码更改为：
...
d_resut1.create(h_img1.size(), CV_8UC1);
d_result2.create(h_img1.size(), CV_8UC1);
d_result3.create(h_img1.size(), CV_8UC1);
d_result4.create(h_img1.size(), CV_8UC1);
d_result5.create(h_img1.size(), CV_8UC1);
d_img1.upload(h_img1);
//Measure initial time ticks
int64 work_begin = getTickCount();
cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
...

我得到以下结果（2倍更好）
时间：0.00503374
FPS:198.659
虽然GpuMat结果预分配带来了显著的性能提升，但对CPU版本的相同修改并没有带来
2.K2100M不是一个非常强大的GPU（665 MHz时有576个内核），考虑到OpenCV可能（取决于编译方式）在CPU（2.90GHz，8个虚拟内核）版本的引擎盖下使用多线程SIMD指令，结果并不令人惊讶
编辑：
通过使用NVIDIA Nsight系统评测应用程序，您可以更好地了解GPU内存操作的惩罚：

如您所见，仅分配和释放内存需要10.5毫秒，而阈值设置本身只需要5毫秒
    Performance of Thresholding on GPU:
    Time: 0.599032
    FPS: 1.66936

Performance of Thresholding on GPU: 
Time: 0.136095
FPS: 7.34779

         1st         2nd         3rd
         CPU         GPU         GPU
Time: 0.0475497   0.599032    0.136095
FPS:  21.0306     1.66936     7.34779

*********************************************************
NVIDIA Quadro K2100M

Micro architecture: Kepler

Compute capability version: 3.0

CUDA Version: 10.1
*********************************************************

*********************************************************
laptop hp ZBook

CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz 2.90 GHZ

RAM: 8.00 GB

OS: Windows 7, 64-bit, Ultimate, Service Pack 1
*********************************************************

...
d_resut1.create(h_img1.size(), CV_8UC1);
d_result2.create(h_img1.size(), CV_8UC1);
d_result3.create(h_img1.size(), CV_8UC1);
d_result4.create(h_img1.size(), CV_8UC1);
d_result5.create(h_img1.size(), CV_8UC1);
d_img1.upload(h_img1);
//Measure initial time ticks
int64 work_begin = getTickCount();
cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
...