C++ 我的opencl测试运行速度并没有CPU快很多
我试图测量GPU的执行时间,并将其与CPU进行比较。 我编写了一个简单的_add函数来添加短int向量的所有元素。 内核代码是:C++ 我的opencl测试运行速度并没有CPU快很多,c++,parallel-processing,opencl,gpu,C++,Parallel Processing,Opencl,Gpu,我试图测量GPU的执行时间,并将其与CPU进行比较。 我编写了一个简单的_add函数来添加短int向量的所有元素。 内核代码是: global const int * A, global const uint * B, global int* C) { ///------------------------------------------------ /// Add 16 bits of each int AA=A[get_global
global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
int AH=0xFFFF0000 & AA;
int AL=0x0000FFFF & AA;
int BH=0xFFFF0000 & BB;
int BL=0x0000FFFF & BB;
int CL=(AL+BL)&0x0000FFFF;
int CH=(AH+BH)&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
我为这个函数编写了另一个CPU版本,在执行100次之后,测量了它们的执行时间
clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
queue.finish();
}
clock_t after_GPU = clock();
clock_t before_CPU = clock();
for(int i=0;i<100;i++)
AddImagesCPU(A,B,C);
clock_t after_CPU = clock();
问题是,我真的期望GPU比CPU快得多,但事实并非如此。我不明白为什么我的GPU速度没有比CPU快多少。我的代码有问题吗??
以下是我的GPU属性:
-----------------------------------------------------
------------- Selected Platform Properties-------------:
NAME: AMD Accelerated Parallel Processing
EXTENSION: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
VENDOR: Advanced Micro Devices, Inc.
VERSION: OpenCL 1.2 AMD-APP (937.2)
PROFILE: FULL_PROFILE
-----------------------------------------------------
------------- Selected Device Properties-------------:
NAME : ATI RV730
TYPE : 4
VENDOR : Advanced Micro Devices, Inc.
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.0 AMD-APP (937.2)
EXTENSIONS : cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
MAX_COMPUTE_UNITS : 8
MAX_WORK_GROUP_SIZE : 128
OPENCL_C_VERSION : OpenCL C 1.0
DRIVER_VERSION: CAL 1.4.1734
==========================================================
比较一下我的CPU规格:
------------- CPU Properties-------------:
NAME : Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
TYPE : 2
VENDOR : GenuineIntel
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.2 AMD-APP (937.2)
MAX_COMPUTE_UNITS : 4
MAX_WORK_GROUP_SIZE : 1024
OPENCL_C_VERSION : OpenCL C 1.2
DRIVER_VERSION: 2.0 (sse2,avx)
==========================================================
我还使用QueryPerformanceCounter测量了挂钟时间,结果如下:
CPU time: 1304449.6 micro-sec
GPU time: 1401740.82 micro-sec
----------------------
CPU time: 1620076.55 micro-sec
GPU time: 1310317.64 micro-sec
----------------------
CPU time: 1468520.44 micro-sec
GPU time: 1317153.63 micro-sec
----------------------
CPU time: 1304367.29 micro-sec
GPU time: 1251865.14 micro-sec
----------------------
CPU time: 1301589.17 micro-sec
GPU time: 1252889.4 micro-sec
----------------------
CPU time: 1294750.21 micro-sec
GPU time: 1257017.41 micro-sec
----------------------
CPU time: 1297506.93 micro-sec
GPU time: 1252768.9 micro-sec
----------------------
CPU time: 1293511.29 micro-sec
GPU time: 1252019.88 micro-sec
----------------------
CPU time: 1320753.54 micro-sec
GPU time: 1248895.73 micro-sec
----------------------
CPU time: 1296486.95 micro-sec
GPU time: 1255207.91 micro-sec
----------------------
我又一次尝试了opencl评测的执行时间
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),
cl::NDRange(64),NULL,&ev);
ev.wait();
queue.finish();
time_start=ev.getProfilingInfo<CL_PROFILING_COMMAND_START>();
time_end=ev.getProfilingInfo<CL_PROFILING_COMMAND_END>();
ATI RV730具有VLIW结构,因此最好尝试
uint4
和int4
向量类型,总线程数为1/4(即numberOfAllegements/16)。这也有助于更快地从内存中加载每个工作项
与内存操作相比,内核没有太多的计算。将缓冲区映射到RAM将具有更好的性能。不要复制数组,请使用map/unmap enqueue命令将它们映射到内存
如果速度仍然不快,您可以同时使用gpu和cpu完成前半部分的工作,并在50%的时间内完成后半部分的工作
也不要把clFinish放在循环中。把它放在循环结束后。这样,它将更快地将其排队,并且它已经按顺序执行,因此它不会在完成第一项之前启动其他项。我想这是一个有序的队列,在每个队列之后添加clfinish是额外的开销。在最新内核之后只需一次clfinish就足够了
ATI RV730:64个VLIW单元,每个单元至少有4个流内核。750兆赫 i3-2100:2个内核(仅用于防冒泡的线程),每个内核都具有AVX,能够同时对8个操作进行流式处理。所以这可以在飞行中进行16次操作。超过3千兆赫 简单地将流操作与频率相乘: ATI RV730=192个单位(更多采用乘法和函数,乘以每个vliw的第5个元素) i3-2100=48台
所以gpu的速度应该至少是4倍(使用int4、uint4)。这适用于简单的ALU和FPU操作,如按位操作和乘法。trancandentals等特殊功能的性能可能会有所不同,因为它们只在每个vliw的第5个单元上运行。我做了一些额外的测试,并意识到GPU针对浮点操作进行了优化。 我更改了测试代码,如下所示:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
并得到了我预期的结果(大约快10倍):
对于稍重一点的浮点操作,如下所示:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
结果大致相同:
CPU time: 13335.1815 micro-sec
GPU time: 11865.111 micro-sec
----------------------
CPU time: 13884.0235 micro-sec
GPU time: 11663.889 micro-sec
----------------------
CPU time: 19724.7296 micro-sec
GPU time: 14548.222 micro-sec
----------------------
CPU time: 19945.3199 micro-sec
GPU time: 15331.111 micro-sec
----------------------
CPU time: 17973.5055 micro-sec
GPU time: 11641.444 micro-sec
----------------------
CPU time: 12652.6683 micro-sec
GPU time: 11632 micro-sec
----------------------
CPU time: 18875.292 micro-sec
GPU time: 14783.111 micro-sec
----------------------
CPU time: 32782.033 micro-sec
GPU time: 11650.444 micro-sec
----------------------
CPU time: 20462.2257 micro-sec
GPU time: 11647.778 micro-sec
----------------------
CPU time: 14529.6618 micro-sec
GPU time: 11860.112 micro-sec
CPU time: 3905725.933 micro-sec
GPU time: 354543.111 micro-sec
-----------------------------------------
CPU time: 3698211.308 micro-sec
GPU time: 354850.333 micro-sec
-----------------------------------------
CPU time: 3696179.243 micro-sec
GPU time: 354302.667 micro-sec
-----------------------------------------
CPU time: 3692988.914 micro-sec
GPU time: 354764.111 micro-sec
-----------------------------------------
CPU time: 3699645.146 micro-sec
GPU time: 354287.666 micro-sec
-----------------------------------------
CPU time: 3681591.964 micro-sec
GPU time: 357071.889 micro-sec
-----------------------------------------
CPU time: 3744179.707 micro-sec
GPU time: 354249.444 micro-sec
-----------------------------------------
CPU time: 3704143.214 micro-sec
GPU time: 354934.111 micro-sec
-----------------------------------------
CPU time: 3667518.628 micro-sec
GPU time: 354809.222 micro-sec
-----------------------------------------
CPU time: 3714312.759 micro-sec
GPU time: 354883.888 micro-sec
-----------------------------------------
clock()
测量CPU时间而不是挂钟时间。它不会计入GPU运行时间。您测量的时间可能是由OpenCLAPI调用占用的。在C++或代码> STD::Calth::SturyYyCalth >中尝试<代码> CcLyLGGETIMECTIME(<)>代码。您没有提到“CPU时间”的单位。如果它是clock()
函数的原始输出(必须除以CLOCKS\u PER\u second
才能得到秒数),1200确实是一个很短的周期。请参阅OpenCL内核时间测量。因为我比较了两个执行时间,所以我认为使用CPU时间或挂钟时间并不重要。但是,我尝试以微秒为单位测量挂钟时间,并添加了此测量值。内核代码内存有限,我怀疑您是否能够对其进行太多优化。OpenCL不是这种工作负载的目标。如果此操作是其他数学的前/后阶段,则应在内核中编写该数学,而不仅仅是位混合步骤。值得注意的是,GPU架构通常针对大量浮点操作进行了优化,而很少关注整数操作。当GPU足够大并且工作量涉及很多整数操作时,我在CPU上的异乎寻常的工作负载在GCPU上的完成速度比GPU要快得多。我没有考虑到我的测量中的数据传输时间。在拷贝之后,gpu必须执行更快的内存映射比较(我还不确定),然后vliw微体系结构需要4个宽的向量,而不是使用的标量。如果有1M个线程,那么现在只有256k个线程可以使用int4delete clfinish from循环更快地完成任务,将它放在循环之后上次使用opencl评测时没有循环(有一次)。结果没有改变(我把它们添加到问题中)。@Afshin运行一次意味着它还没有优化。多次运行后效果会更好。排队10次,最后完成一次
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
CPU time: 3905725.933 micro-sec
GPU time: 354543.111 micro-sec
-----------------------------------------
CPU time: 3698211.308 micro-sec
GPU time: 354850.333 micro-sec
-----------------------------------------
CPU time: 3696179.243 micro-sec
GPU time: 354302.667 micro-sec
-----------------------------------------
CPU time: 3692988.914 micro-sec
GPU time: 354764.111 micro-sec
-----------------------------------------
CPU time: 3699645.146 micro-sec
GPU time: 354287.666 micro-sec
-----------------------------------------
CPU time: 3681591.964 micro-sec
GPU time: 357071.889 micro-sec
-----------------------------------------
CPU time: 3744179.707 micro-sec
GPU time: 354249.444 micro-sec
-----------------------------------------
CPU time: 3704143.214 micro-sec
GPU time: 354934.111 micro-sec
-----------------------------------------
CPU time: 3667518.628 micro-sec
GPU time: 354809.222 micro-sec
-----------------------------------------
CPU time: 3714312.759 micro-sec
GPU time: 354883.888 micro-sec
-----------------------------------------