Performance 元组作为固定大小向量的有效处理
在Chapel中,同构元组可以像小“向量”一样使用(例如,Performance 元组作为固定大小向量的有效处理,performance,parallel-processing,tuples,chapel,parallelism-amdahl,Performance,Parallel Processing,Tuples,Chapel,Parallelism Amdahl,在Chapel中,同构元组可以像小“向量”一样使用(例如,a=b+c*3.0+5.0;) 但是,由于没有为元组提供各种数学函数,因此我尝试以多种方式为norm()编写函数,并比较了它们的性能。我的代码是这样的: proc norm_3tuple( x: 3*real ): real { return sqrt( x[1]**2 + x[2]**2 + x[3]**2 ); } proc norm_loop( x ): real { var tmp = 0.0; for
a=b+c*3.0+5.0;
)
但是,由于没有为元组提供各种数学函数,因此我尝试以多种方式为norm()
编写函数,并比较了它们的性能。我的代码是这样的:
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
var tmp = ( + reduce x**2 );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
config const nloops = 100000000; // 1E+8
var res = 0.0;
for k in 1 .. nloops
{
a[ 1 ] = (k % 5): real;
res += norm_3tuple( a );
// res += norm_loop( a );
// res += norm_loop_param( a );
// res += norm_reduce( a );
}
writeln( "result = ", res );
我用chpl--fast test.chpl
编译了上述代码(OSX10.11上的Chapel v1.16,4核,通过自制安装)。然后,norm\u 3tuple()
、norm\u loop()
和norm\u loop\u param()
给出了几乎相同的速度(0.45秒),而norm\u reduce()
则慢得多(大约30秒)。我检查了top
命令的输出,然后norm\u reduce()
使用了所有4个内核,而其他函数只使用了1个内核。所以我的问题是
- 是不是
很慢,因为norm\u reduce()
是并行工作的 而且并行执行的开销很大 大于这个小元组的净计算成本reduce
- 考虑到我们希望避免3元组的
,其他三个例程的运行速度基本相同。这是否意味着显式for循环对于3元组的成本可以忽略不计(例如,通过reduce
选项启用的循环展开)--fast
- 在
中,我还尝试对循环变量使用norm\u loop\u param()
关键字,但这几乎没有或几乎没有性能提升。如果我们只对同构元组感兴趣,是否根本不需要附加param
(为了性能)param
norm\u reduce()
是否很慢,因为reduce
是并行工作的,并行执行的开销远远大于这个小元组的净计算成本
我相信你是对的,这就是正在发生的事情。缩减是并行执行的,Chapel目前没有尝试在工作可能无法保证的情况下(如本例所示)进行任何智能节流来挤压这种并行性,因此我认为您正承受着太多的任务开销,除了与其他任务协调之外,几乎什么工作都不做(虽然我对差异如此之大感到惊讶……但我也发现我对这些事情没有什么直觉)。在未来,我们希望编译器能够序列化如此小的缩减,以避免这些开销
考虑到我们希望避免三元组的减少
,其他三个例程的运行速度基本相同。这是否意味着三元组的显式循环的成本可以忽略不计(例如,通过--fast
选项启用的循环展开)
Chapel编译器不会在norm\u loop()
中展开显式for循环(您可以通过检查使用--savec
标志生成的代码来验证这一点),但可能是后端编译器是。或者for循环与norm\u loop\u param()的展开循环相比,成本并没有那么高
。我怀疑您需要检查生成的程序集以确定哪种情况是这样。但我也希望后端C编译器能够妥善处理我们生成的代码,例如,它很容易看到这是一个3次迭代的循环
在norm\u loop\u param()
中,我也尝试使用param
关键字作为循环变量,但这给了我很少或根本没有性能提升。如果我们只对同构元组感兴趣,是否根本不需要附加param
这很难给出一个明确的答案,因为我认为这主要是一个关于后端C编译器有多好的问题。事后评论:实际上在最后还有第三个显著的性能惊喜
性能?
基准!…始终,没有例外,没有借口
这就是为什么如此伟大。非常感谢Chapel团队在过去十年中为HPC开发和改进了如此伟大的计算工具
由于对真正的
[PARALLEL]工作的热爱,性能始终是设计实践和底层系统硬件的结果,从来没有一个公正的语法构造函数被授予“奖金”
norm\u reduce()
系统化处理只需花费数毫秒即可设置所有启用并发的reduce
计算设施,以便稍后仅生成一个x**2
产品,并将其返回延迟的中央+
-reducer engi一个2时钟CPU UOP的开销相当大,不是吗
出于某种原因,人们可以
代码基准测试—实际上同时带来了两个惊喜:
与往常一样,SuperComputing2017 HPC在技术论文或基准测试中发布的每个方面都促进了[再现性 这些结果是在Try-it Online赞助的在线平台上收集的,欢迎所有感兴趣的爱好者重新运行并发布Chapel代码的本地主机/群集操作的性能详细信息,以便更好地记录上述观察时间的硬件系统相关变化()
第三个惊喜出现了——来自 虽然
[SEQ]
-nloops
-ed代码因相关的附加开销而受到严重破坏,但重新制定的一个小问题显示出了非常不同的性能级别a
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5677.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5818.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 4886.0 [us] -- 3.74166
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 7769.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 9109.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 8807.0 [us] -- 3.74166
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop_param(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
var tmp = ( + reduce x**2 );
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write( "[PAR]: norm_reduce(): ",
aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
[LOOP] norm_3tuple(): 45829.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_3tuple(): 241680 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_3tuple(): 2387080 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop(): 72160.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop(): 755959 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop(): 7783740 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop_param(): 34102.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop_param(): 365510 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop_param(): 3480310 [us] -- result = 4.30918e+08 @ 100000000 loops.
-------------------------------------------------------------------------1000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 5851380 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5884600 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6163690 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6029860 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6083730 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6132720 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6012620 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6379020 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5923550 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6144660 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 8098380 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6215470 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5831670 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6124580 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6092740 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5811260 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5880400 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5898520 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6591110 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5876570 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6034180 [us] -- result = 4309.18 @ 1000 loops. [--fast]
-------------------------------------------------------------------------2000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 12434700 [us] -- result = 8618.36 @ 2000 loops.
-------------------------------------------------------------------------3000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 17807600 [us] -- result = 12927.5 @ 3000 loops.
-------------------------------------------------------------------------4000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 23844300 [us] -- result = 17236.7 @ 4000 loops.
-------------------------------------------------------------------------5000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 30557700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 30523700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29404200 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29268600 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 29009500 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 30388800 [us] -- result = 21545.9 @ 5000 loops. [--fast]
-------------------------------------------------------------------------6000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 37070600 [us] -- result = 25855.1 @ 6000 loops.
-------------------------------------------------------------------------7000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 42789200 [us] -- result = 30164.3 @ 7000 loops.
---------------------------------------------------------------------8000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 50572700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49944300 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49365600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): ~60+ // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 50099900 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49445500 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49783800 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48533400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48966600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47564700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47087400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47624300 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 46887700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46571800 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46794700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46862600 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 47348700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46669500 [us] -- result = 34473.4 @ 8000 loops. [--fast]
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer;
config const nloops = 100000000; // 1E+8
var res: atomic real;
res.write( 0.0 );
//------------------------------------------------------------------// PRE-COMPUTE:
var A1: [1 .. nloops] real; // pre-compute a tuple-element value
forall k in 1 .. nloops do // pre-compute a tuple-element value
A1[k] = (k % 5): real; // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.start();
forall i in 1 .. nloops do
{ // a[1] = ( i % 5 ): real; // pre-compute'd
res.add( norm_reduce( ( A1[i], a[1], a[2] ) ) ); // atomic.add()
// res += norm_reduce( ( ( i % 5 ): real, a[1], a[2] ) ); // non-atomic
//:49: note: The shadow variable 'res' is constant due to forall intents in this loop
}/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write(
"forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ", aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
/*
--------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}-------------
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 7911.0 [us] -- result = 320.196 @ 100 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8055.0 [us] -- result = 3201.96 @ 1000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8002.0 [us] -- result = 32019.6 @ 10000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 80685.0 [us] -- result = 3.20196e+05 @ 100000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 842948 [us] -- result = 3.20196e+06 @ 1000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8005300 [us] -- result = 3.20196e+07 @ 10000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 2195000 [us] -- result = 1.60098e+08 @ 50000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4518790 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 6178440 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4755940 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4405480 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4509170 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4736110 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4653610 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4397990 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4655240 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
*/