Performance 元组作为固定大小向量的有效处理_Performance_Parallel Processing_Tuples_Chapel_Parallelism Amdahl

Performance 元组作为固定大小向量的有效处理

performance parallel-processing

Performance 元组作为固定大小向量的有效处理,performance,parallel-processing,tuples,chapel,parallelism-amdahl,Performance,Parallel Processing,Tuples,Chapel,Parallelism Amdahl,在Chapel中，同构元组可以像小“向量”一样使用（例如，a=b+c*3.0+5.0；）但是，由于没有为元组提供各种数学函数，因此我尝试以多种方式为norm（）编写函数，并比较了它们的性能。我的代码是这样的： proc norm_3tuple( x: 3*real ): real { return sqrt( x[1]**2 + x[2]**2 + x[3]**2 ); } proc norm_loop( x ): real { var tmp = 0.0; for

在Chapel中，同构元组可以像小“向量”一样使用（例如，

a=b+c*3.0+5.0；

）

但是，由于没有为元组提供各种数学函数，因此我尝试以多种方式为

norm（）

编写函数，并比较了它们的性能。我的代码是这样的：

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

我用

chpl--fast test.chpl

编译了上述代码（OSX10.11上的Chapel v1.16，4核，通过自制安装）。然后，

norm\u 3tuple（）

、

norm\u loop（）

和

norm\u loop\u param（）

给出了几乎相同的速度（0.45秒），而

norm\u reduce（）

则慢得多（大约30秒）。我检查了

top

命令的输出，然后

norm\u reduce（）

使用了所有4个内核，而其他函数只使用了1个内核。所以我的问题是

是不是
```
norm\u reduce（）
```
很慢，因为
```
reduce
```
是并行工作的而且并行执行的开销很大大于这个小元组的净计算成本
考虑到我们希望避免3元组的
```
reduce
```
，其他三个例程的运行速度基本相同。这是否意味着显式for循环对于3元组的成本可以忽略不计（例如，通过
```
--fast
```
选项启用的循环展开）
在
```
norm\u loop\u param（）
```
中，我还尝试对循环变量使用
```
param
```
关键字，但这几乎没有或几乎没有性能提升。如果我们只对同构元组感兴趣，是否根本不需要附加
```
param
```
（为了性能）

我很抱歉一下子问了很多问题，如果您能给我提供有效处理小元组的建议，我将不胜感激。非常感谢

norm\u reduce（）
是否很慢，因为reduce
是并行工作的，并行执行的开销远远大于这个小元组的净计算成本

我相信你是对的，这就是正在发生的事情。缩减是并行执行的，Chapel目前没有尝试在工作可能无法保证的情况下（如本例所示）进行任何智能节流来挤压这种并行性，因此我认为您正承受着太多的任务开销，除了与其他任务协调之外，几乎什么工作都不做（虽然我对差异如此之大感到惊讶……但我也发现我对这些事情没有什么直觉）。在未来，我们希望编译器能够序列化如此小的缩减，以避免这些开销

考虑到我们希望避免三元组的减少
，其他三个例程的运行速度基本相同。这是否意味着三元组的显式

循环的成本可以忽略不计（例如，通过--fast选项启用的循环展开）
Chapel编译器不会在norm\u loop（）
中展开显式for循环（您可以通过检查使用--savec标志生成的代码来验证这一点），但可能是后端编译器是。或者for循环与norm\u loop\u param（）的展开循环相比，成本并没有那么高
。我怀疑您需要检查生成的程序集以确定哪种情况是这样。但我也希望后端C编译器能够妥善处理我们生成的代码，例如，它很容易看到这是一个3次迭代的循环
在norm\u loop\u param（）中，我也尝试使用param关键字作为循环变量，但这给了我很少或根本没有性能提升。如果我们只对同构元组感兴趣，是否根本不需要附加param

这很难给出一个明确的答案，因为我认为这主要是一个关于后端C编译器有多好的问题。
事后评论：实际上在最后还有第三个显著的性能惊喜

性能？
基准！…始终，没有例外，没有借口
这就是为什么如此伟大。非常感谢Chapel团队在过去十年中为HPC开发和改进了如此伟大的计算工具
由于对真正的

[PARALLEL]工作的热爱，性能始终是设计实践和底层系统硬件的结果，从来没有一个公正的语法构造函数被授予“奖金”

norm\u reduce（）
系统化处理只需花费数毫秒即可设置所有启用并发的reduce
计算设施，以便稍后仅生成一个

x**2

产品，并将其返回延迟的中央+
-reducer engi一个2时钟CPU UOP的开销相当大，不是吗

出于某种原因，人们可以

代码基准测试—实际上同时带来了两个惊喜：

与往常一样，SuperComputing2017 HPC在技术论文或基准测试中发布的每个方面都促进了[再现性
这些结果是在Try-it Online赞助的在线平台上收集的，欢迎所有感兴趣的爱好者重新运行并发布Chapel代码的本地主机/群集操作的性能详细信息，以便更好地记录上述观察时间的硬件系统相关变化（）

第三个惊喜出现了——来自虽然[SEQ]
-
nloops
-ed代码因相关的附加开销而受到严重破坏，但重新制定的一个小问题显示出了非常不同的性能级别a
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 0.0 [us] -- 3.74166 [PAR]: norm_reduce(): 5677.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 1.0 [us] -- 3.74166 [PAR]: norm_reduce(): 5818.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 2.0 [us] -- 3.74166 [PAR]: norm_reduce(): 4886.0 [us] -- 3.74166

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 2.0 [us] -- 3.74166 [PAR]: norm_reduce(): 7769.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 0.0 [us] -- 3.74166 [PAR]: norm_reduce(): 9109.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 1.0 [us] -- 3.74166 [PAR]: norm_reduce(): 8807.0 [us] -- 3.74166

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer; proc norm_3tuple( x: 3*real ): real { return sqrt( x[1]**2 + x[2]**2 + x[3]**2 ); } proc norm_loop( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start(); var tmp = 0.0; for i in 1 .. x.size do tmp += x[i]**2; /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop(): ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } proc norm_loop_param( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start(); var tmp = 0.0; for param i in 1 .. x.size do tmp += x[i]**2; /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop_param(): ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } proc norm_reduce( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start(); var tmp = ( + reduce x**2 ); /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write( "[PAR]: norm_reduce(): ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } //......................................................... var a = ( 1.0, 2.0, 3.0 ); // consistency check writeln( norm_3tuple( a ) ); writeln( norm_loop( a ) ); writeln( norm_loop_param( a ) ); writeln( norm_reduce( a ) );

[LOOP] norm_3tuple(): 45829.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_3tuple(): 241680 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_3tuple(): 2387080 [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP] norm_loop(): 72160.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_loop(): 755959 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_loop(): 7783740 [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP] norm_loop_param(): 34102.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_loop_param(): 365510 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_loop_param(): 3480310 [us] -- result = 4.30918e+08 @ 100000000 loops.

-------------------------------------------------------------------------1000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 5851380 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 5884600 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6163690 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6029860 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6083730 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6132720 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6012620 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6379020 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 5923550 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6144660 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 8098380 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6215470 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5831670 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6124580 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6092740 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5811260 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5880400 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5898520 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6591110 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5876570 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6034180 [us] -- result = 4309.18 @ 1000 loops. [--fast] -------------------------------------------------------------------------2000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 12434700 [us] -- result = 8618.36 @ 2000 loops. -------------------------------------------------------------------------3000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 17807600 [us] -- result = 12927.5 @ 3000 loops. -------------------------------------------------------------------------4000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 23844300 [us] -- result = 17236.7 @ 4000 loops. -------------------------------------------------------------------------5000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 30557700 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 30523700 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 29404200 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 29268600 [us] -- result = 21545.9 @ 5000 loops. [--fast] [LOOP] norm_reduce(): 29009500 [us] -- result = 21545.9 @ 5000 loops. [--fast] [LOOP] norm_reduce(): 30388800 [us] -- result = 21545.9 @ 5000 loops. [--fast] -------------------------------------------------------------------------6000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 37070600 [us] -- result = 25855.1 @ 6000 loops. -------------------------------------------------------------------------7000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 42789200 [us] -- result = 30164.3 @ 7000 loops. ---------------------------------------------------------------------8000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 50572700 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49944300 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49365600 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): ~60+ // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): 50099900 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49445500 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49783800 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 48533400 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 48966600 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47564700 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47087400 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47624300 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): 46887700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46571800 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46794700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46862600 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 47348700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46669500 [us] -- result = 34473.4 @ 8000 loops. [--fast]

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer; config const nloops = 100000000; // 1E+8 var res: atomic real; res.write( 0.0 ); //------------------------------------------------------------------// PRE-COMPUTE: var A1: [1 .. nloops] real; // pre-compute a tuple-element value forall k in 1 .. nloops do // pre-compute a tuple-element value A1[k] = (k % 5): real; // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.start(); forall i in 1 .. nloops do { // a[1] = ( i % 5 ): real; // pre-compute'd res.add( norm_reduce( ( A1[i], a[1], a[2] ) ) ); // atomic.add() // res += norm_reduce( ( ( i % 5 ): real, a[1], a[2] ) ); // non-atomic //:49: note: The shadow variable 'res' is constant due to forall intents in this loop }/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write( "forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ", aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); /* --------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}------------- forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 7911.0 [us] -- result = 320.196 @ 100 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8055.0 [us] -- result = 3201.96 @ 1000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8002.0 [us] -- result = 32019.6 @ 10000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 80685.0 [us] -- result = 3.20196e+05 @ 100000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 842948 [us] -- result = 3.20196e+06 @ 1000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8005300 [us] -- result = 3.20196e+07 @ 10000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900 [us] -- result = 1.60098e+08 @ 50000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200 [us] -- result = 1.60098e+08 @ 50000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 2195000 [us] -- result = 1.60098e+08 @ 50000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4518790 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 6178440 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4755940 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4405480 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4509170 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4736110 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4653610 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4397990 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4655240 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] */