C++ 我的GPU加速opencv代码比普通opencv慢
我复制了《使用OpenCV和CUDA使用GPU加速计算机视觉》一书中的两个例子,以比较CPU和GPU的性能 第1代码:C++ 我的GPU加速opencv代码比普通opencv慢,c++,opencv,gpu,C++,Opencv,Gpu,我复制了《使用OpenCV和CUDA使用GPU加速计算机视觉》一书中的两个例子,以比较CPU和GPU的性能 第1代码: cv::Mat src=cv::imread(“D:/Pics/Pen.jpg”,0);//Pen.jpg是一张4096*4096的灰度图片。 cv::Mat result_host1、result_host2、result_host3、result_host4、result_host5; //以毫秒为单位获取初始时间 int64 work_begin=getTickCount
cv::Mat src=cv::imread(“D:/Pics/Pen.jpg”,0);//Pen.jpg是一张4096*4096的灰度图片。
cv::Mat result_host1、result_host2、result_host3、result_host4、result_host5;
//以毫秒为单位获取初始时间
int64 work_begin=getTickCount();
cv::threshold(src,result_host1,128.0,255.0,cv::THRESH_二进制);
cv::threshold(src,result_host2,128.0,255.0,cv::THRESH_BINARY_INV);
cv::threshold(src,result_host3,128.0,255.0,cv::THRESH_TRUNC);
cv::threshold(src,result_host4,128.0,255.0,cv::THRESH_TOZERO);
cv::threshold(src,result_host5,128.0,255.0,cv::THRESH_to zero_INV);
//工作结束后争取时间
int64 delta=getTickCount()-开始工作;
//定时器频率
double freq=getTickFrequency();
双工作频率=频率/增量;
std::cout我能想到两个原因,为什么即使没有内存操作,CPU版本也会更快:
1.在第2和第3个代码版本中,您声明了结果GpuMat,但没有实际初始化它们,通过调用GpuMat.create,结果GpuMat的初始化将在阈值方法内发生,这将导致每次执行80MB的GPU内存分配,您可以看到“性能改进”通过初始化结果gpumat一次,然后重用它们。
使用原始的第3个代码,我得到以下结果(Geforce RTX 2080):
时间:0.010208
FPS:97.9624
当我将代码更改为:
...
d_resut1.create(h_img1.size(), CV_8UC1);
d_result2.create(h_img1.size(), CV_8UC1);
d_result3.create(h_img1.size(), CV_8UC1);
d_result4.create(h_img1.size(), CV_8UC1);
d_result5.create(h_img1.size(), CV_8UC1);
d_img1.upload(h_img1);
//Measure initial time ticks
int64 work_begin = getTickCount();
cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
...
我得到以下结果(2倍更好)
时间:0.00503374
FPS:198.659
虽然GpuMat结果预分配带来了显著的性能提升,但对CPU版本的相同修改并没有带来
2.K2100M不是一个非常强大的GPU(665 MHz时有576个内核),考虑到OpenCV可能(取决于编译方式)在CPU(2.90GHz,8个虚拟内核)版本的引擎盖下使用多线程SIMD指令,结果并不令人惊讶
编辑:
通过使用NVIDIA Nsight系统评测应用程序,您可以更好地了解GPU内存操作的惩罚:
如您所见,仅分配和释放内存需要10.5毫秒,而阈值设置本身只需要5毫秒
Performance of Thresholding on GPU:
Time: 0.599032
FPS: 1.66936
Performance of Thresholding on GPU:
Time: 0.136095
FPS: 7.34779
1st 2nd 3rd
CPU GPU GPU
Time: 0.0475497 0.599032 0.136095
FPS: 21.0306 1.66936 7.34779
*********************************************************
NVIDIA Quadro K2100M
Micro architecture: Kepler
Compute capability version: 3.0
CUDA Version: 10.1
*********************************************************
*********************************************************
laptop hp ZBook
CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz 2.90 GHZ
RAM: 8.00 GB
OS: Windows 7, 64-bit, Ultimate, Service Pack 1
*********************************************************
...
d_resut1.create(h_img1.size(), CV_8UC1);
d_result2.create(h_img1.size(), CV_8UC1);
d_result3.create(h_img1.size(), CV_8UC1);
d_result4.create(h_img1.size(), CV_8UC1);
d_result5.create(h_img1.size(), CV_8UC1);
d_img1.upload(h_img1);
//Measure initial time ticks
int64 work_begin = getTickCount();
cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
...