Performance 如何避免循环中的条件句_Performance_Optimization_Fortran_Fortran90

Performance 如何避免循环中的条件句

performance optimization fortran

Performance 如何避免循环中的条件句,performance,optimization,fortran,fortran90,Performance,Optimization,Fortran,Fortran90,在本文中，作者给出了一个例子 subroutine threshold(a, thresh, ic) real, dimension(:), intent(in) :: a real, intent(in) :: thresh integer, intent(out) :: ic real :: tt integer :: n ic = 0 tt = 0.d0 n = size(a) do j = 1, n tt = tt + a(j) * a(j

在本文中，作者给出了一个例子

subroutine threshold(a, thresh, ic)
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt
  integer :: n

  ic = 0
  tt = 0.d0
  n = size(a)
  do j = 1, n
     tt = tt + a(j) * a(j)
     if (sqrt(tt) >= thresh) then
        ic = j
        return
     end if
  end do
end subroutine threshold

作者将此代码注释为

另一种方法，允许进行许多优化（循环展开，CPU流水线，花在评估有条件的）将涉及在块中添加tt（例如，大小的块 128）并在每个块后检查条件。什么时候条件满足时，可重复最后一个块以确定 ic的价值

这是什么意思？循环展开？CPU流水线？在块中添加
tt
？

如何像作者所说的那样优化代码？

如果循环是在适合CPU缓存的块/块中执行的，您将减少缓存未命中的数量，从而减少从内存检索到的缓存线的数量。这将提高受内存操作限制的所有循环的性能。如果相应的块大小为
BLOCKSIZE
，则通过

do j = 1, n, BLOCKSIZE do jj = j, j+BLOCKSIZE-1 tt = tt + a(jj) * a(jj) end do end do
然而，这将留下一个未在主循环中处理的余数。为了说明这一点，考虑一个长度数组<代码> 1000 < /代码>。前七个块（1-896）包含在循环中，但第八个块（897-1024）不包含在循环中。因此，需要为其余部分创建另一个循环：

do j=(n/BLOCKSIZE)*BLOCKSIZE,n ! ... enddo
虽然从余数循环中删除条件没有什么意义，但它可以在阻塞的主循环的外循环中执行。由于现在内部循环中没有分支，因此可能需要进行积极的优化。
但是，这将确定位置的“精度”限制为块。要达到元素精度，必须重复计算
以下是完整的代码：

subroutine threshold_block(a, thresh, ic) implicit none real, dimension(:), intent(in) :: a real, intent(in) :: thresh integer, intent(out) :: ic real :: tt, tt_bak, thresh_sqr integer :: n, j, jj integer,parameter :: BLOCKSIZE = 128 ic = 0 tt = 0.d0 thresh_sqr = thresh**2 n = size(a) ! Perform the loop in chunks of BLOCKSIZE do j = 1, n, BLOCKSIZE tt_bak = tt do jj = j, j+BLOCKSIZE-1 tt = tt + a(jj) * a(jj) end do ! Perform the check on the block level if (tt >= thresh_sqr) then ! If the threshold is reached, repeat the last block ! to determine the last position tt = tt_bak do jj = j, j+BLOCKSIZE-1 tt = tt + a(jj) * a(jj) if (tt >= thresh_sqr) then ic = jj return end if end do end if end do ! Remainder is treated element-wise do j=(n/BLOCKSIZE)*BLOCKSIZE,n tt = tt + a(j) * a(j) if (tt >= thresh_sqr) then ic = j return end if end do end subroutine threshold_block
请注意，现在的编译器在结合其他优化创建阻塞循环方面非常出色。根据我的经验，手动调整这样简单的循环很难获得更好的性能。
使用编译器选项
-floop block
在
gfortran
中启用循环阻塞

循环展开可以手动完成，但应由编译器完成。其思想是在块中手动执行循环，并通过复制代码来执行操作，而不是如上所示的第二个循环。下面是上面给出的内部循环的一个示例，用于因子4的循环展开：

do jj = j, j+BLOCKSIZE-1,4 tt = tt + a(jj) * a(jj) tt = tt + a(jj+1) * a(jj+1) tt = tt + a(jj+2) * a(jj+2) tt = tt + a(jj+3) * a(jj+3) end do
这里，如果
BLOCKSIZE
是
4
的倍数，则不会出现余数。您可能可以在这里省去一些操作；-）启用此功能的
gfortran
编译器选项是
-funroll循环

据我所知，在Fortran中无法手动执行CPU管道化。这项任务由编译器决定
流水线建立了一个指令管道。您将整个阵列馈送到该管道中，在结束阶段之后，您将获得每个时钟周期的结果。这大大增加了吞吐量。
但是，支管很难（不可能？）在管道中处理，且阵列应足够长，以补偿安装管道、上风和下风阶段所需的时间
嗯，我认为首先要做的是将“sqrt（tt）>=thresh”替换为“tt>=thresh\u sq”（其中thresh\u sq=thresh**2在进入循环之前准备好）…@roygvib这很聪明+1我擅自更正了代码，使其能够实际编译；-）这里考虑的另一个可能的（简单的）优化是预先计算计算结果的长度<代码> n>代码>数组（在这个例子中，它将是一个代码行<代码> A2= A** 2 ），那么您的循环将只包含加法和条件。当然，这个方法的有效性取决于很多因素，比如n的大小和条件在循环早期被满足的可能性。很好的解释。应该强调的是，只有在分析了代码并确定特定循环是一个关键瓶颈的情况下，才应该进行这样的工作。如果您经常编写这样的代码，那么在可读性/维护和引入错误的可能性方面存在明显的缺陷。@agentp我同意，即使如此，块大小、展开因子，尤其是流水线在很大程度上取决于代码执行的机器。除非你确切知道自己在做什么，否则编译器可能会比你做得更好。@AlexanderVogt解释得非常清楚！我学到了很多。但正如我测试的那样，修改后的版本只比原始版本快一点。@AlexanderVogt对于100000000数组，我尝试了
ifort/fast
进行最大优化，非块版本需要1.494s，块版本需要1.447s。如果我们使用
ifort/Od
禁用所有优化，那么非块版本需要4.28秒，块版本需要4.43秒，所以块实际上比没有优化的非块版本慢