我正在尝试使用Chapel for matrix乘法来改进我的运行时间_Chapel

我正在尝试使用Chapel for matrix乘法来改进我的运行时间

我正在尝试使用Chapel for matrix乘法来改进我的运行时间,chapel,Chapel,我正在努力提高我的矩阵乘法速度。我还可以做其他的实现来加速它吗这是我到目前为止的结果，我尝试了8192，但是花了2个多小时，我的ssh连接超时。以下是我的实现： use Random, Time; var t : Timer; t.start(); config const size = 10; var grid : [1..size, 1..size] real; var grid2 : [1..size, 1..size] real; var grid3 : [1..size, 1

我正在努力提高我的矩阵乘法速度。我还可以做其他的实现来加速它吗这是我到目前为止的结果，我尝试了8192，但是花了2个多小时，我的ssh连接超时。

以下是我的实现：

use Random, Time;
var t : Timer;
t.start();

config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid);
fillRandom(grid2);

//t.start();
forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();

我将时间与C++中的MPI实现进行比较。我想知道是否有一种方法可以将我的矩阵分发到我的两个类似于MPI的地区？

这种forall循环的嵌套在我们当前的实现中并不能提供最好的性能。如果在定义（i，j）迭代空间的单个二维域上进行迭代，则算法将执行得更快。在k上进行串行循环将避免更新grid3[i，j]时的数据竞争。例如：

....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
  for k in 1..size do
    grid3[i,j] += grid[i,k] * grid2[k,j];

要分布矩阵，可以使用块分布（例如，请参见我们的示例中的示例）。在分发时，您当然需要注意地区之间的额外通信

测试性能时，请确保使用

--fast

编译

事后评论：请参阅经基准测试的VAS在托管生态系统上提出的性能优势，如果能在多区域硅上重现这一点，以证明其在更大范围内的通用性，那将是非常棒的

表演
基准，没有例外，没有借口-惊喜并不罕见

-ccflags-O3

这就是为什么如此伟大。非常感谢Chapel团队在过去十年中为HPC开发和改进了如此强大的计算工具

由于对真正的[PARALLEL]工作的热爱，性能始终是设计实践和底层系统硬件的结果，从来没有一个公正的语法构造函数被授予“奖金”

使用，单个区域设置的结果如下：

TiO.run platform uses   1 numLocales,
               having   2 physical CPU-cores accessible (numPU-s)
                 with   2 maxTaskPar parallelism limit

For grid{1,2,3}[ 128, 128] the tested forall sum-product took       3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took       2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took       1.925 [us] excl. fillRandom()-ops

For grid{1,2,3}[ 256, 256] the tested forall sum-product took      28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took      25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took      21.493 [us] excl. fillRandom()-ops

For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took   2.103.592 [us] excl. fillRandom()-ops

For grid{1,2,3}[2048,2048] the tested forall sum-product took  27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took  26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took  25.351.754 [us] excl. fillRandom()-ops

For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  41.304.391 [us] excl. fillRandom()-ops

--fast --ccflags -O3

For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  44.611.009 [us] excl. fillRandom()-ops

结果：在单个区域设置（单个虚拟化处理器，2个内核，运行所有公共时间受限（<60秒））的时间共享公共工作负载）设备上，仍然是次优的（Vass的方法更进一步，大约快了20%）英伟达JETSON托管计算的网格规模<代码> 2048×2048 > /代码>关于<强> > 4.17×更快< /强>结果。随着内存I/O范围的扩大和CPU-L1/L2/L3缓存预取（应该遵循~O（n^3）
缩放）将进一步改善基于CPU的NUMA平台的性能优势，对于更大的内存布局，这种性能优势似乎会进一步扩大

如果能在多语言环境设备和本机Cray NUMA群集平台上看到可实现的性能和

~O（n^3）

扩展遵从性，那就太好了

在Chapel中，forall循环不会自动将工作或数据分布到不同的区域（想想：计算节点或内存）。相反，forall循环调用与正在迭代的对象相关联的并行迭代器

因此，如果您正在迭代单个区域设置的本地内容，如范围（如代码中使用的

1..size

）或非分布式域或数组（如代码中的

grid

），则用于实现并行循环的所有任务都将在原始区域设置上本地执行。相反，如果您正在迭代分布式域或数组（例如，一个分布式域或数组），或调用分布式迭代器（例如，来自模块的迭代器），则任务将分布在iterand目标的所有区域

因此，任何不引用其他语言环境的Chapel程序，无论是通过on子句显式引用还是通过封装on子句的抽象隐式引用，如上述分布式数组和迭代器，都不会使用初始语言环境以外的资源

我还想提供一个关于分布式算法的旁注：即使您要更新上面的程序以跨多个区域分布网格阵列和forall循环，三重嵌套循环方法很少是分布式内存系统上的最佳矩阵乘法算法，因为它不能很好地优化区域性。最好是研究为分布式内存设计的矩阵乘法算法（例如）

另请参阅我们的发行版入门：我不同意forall循环嵌套不好的评论，但同意对于共享内存代码，在3D域上迭代可能更可取。对于分布式内存，我认为您需要做的不仅仅是像这里建议的那样使用块分布。要获得良好的性能，可能需要对为分布式内存设计的算法进行彻底的更改。将原来的solo-D3-do-{}interator域修改为forall-in-D2-for（k）-do-{}串联迭代器后，您可能会注意到，对于较小的矩阵大小，在单个区域设置设备上的性能已显著降低（由于TiO.RUN public platform上运行Chapel任务的已知限制为~60[s]，因此，如果没有@Brad或其他Cray Chapel团队成员的帮助，较大的大小仍然无法进行测试，同时问题大小也无法进一步扩展）测量了--size={128 | 256 | 512 | 640}的任一用例的执行时间这里发布的fit应该有一个选项，可以通过ssh连接发送“心跳”，以使您的研究基准规模保持在小时以上。此外，如果已知（似乎不在[us]下）已使用的（单调的）分辨率，测量的时间可以用科学符号8.97355E-1或固定格式20.6f表示clock.clock.t{.start（）|.stop（）}-部分应该避免所有与计算无关的操作（这里，肯定是两个fillRandom（）-ops），以便在所审查的问题中对苹果进行比较（缩放，而不是fill-s期间的随机生成/RAM-I/O）@Brad希望下面的错误消息可能是

use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;

t1.start(); //----------------------------------------------------------------[1]

config const size = 2048;

var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid1);
fillRandom(grid2);

t2.start(); //====================================[2]

forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid1[i,k] * grid2[k,j];
        }
    }
}

t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
                      size,
                      size,
                      t1.elapsed( TimeUnits.microseconds )
                      );
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();

const D3 = {1..size, 1..size, 1..size};

t2.start(); //====================================[3]

forall (i,j,k) in D3 do
  grid3[i,j] += grid1[i,k] * grid2[k,j];

t2.stop(); //=====================================[3]
             writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\
//       TiO.run platform uses   1 numLocales, having   2 physical CPU-cores accessible (numPU-s) with   2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
                      numLocales,
                      here.numPUs( false, true ),
                      here.maxTaskPar
                      );