Optimization 优化Fortran子程序_Optimization_Random_Fortran

Optimization 优化Fortran子程序

optimization random fortran

Optimization 优化Fortran子程序,optimization,random,fortran,Optimization,Random,Fortran,我已经用Fortran编写了fast的最小实现，以替换内在的随机数。这个实现非常快（比随机数快4倍），而且质量足够好，我不在加密应用程序中使用它我的问题是如何优化这个子例程，以从编译器中获得最后一点性能，即使提高10%也是值得赞赏的。该子程序将用于长模拟中的紧循环。我更感兴趣的是一次生成一个随机数，而不是一次生成大向量或nD数组下面是一个测试程序，为您提供有关如何使用我的子程序的一些上下文： program test_xoroshiro128plus implicit none

我已经用Fortran编写了fast的最小实现，以替换内在的

随机数

。这个实现非常快（比随机数快4倍），而且质量足够好，我不在加密应用程序中使用它

我的问题是如何优化这个子例程，以从编译器中获得最后一点性能，即使提高10%也是值得赞赏的。该子程序将用于长模拟中的紧循环。我更感兴趣的是一次生成一个随机数，而不是一次生成大向量或nD数组

下面是一个测试程序，为您提供有关如何使用我的子程序的一些上下文：

program test_xoroshiro128plus
   implicit none
   integer, parameter :: n = 10000
   real*8  :: A(n,n)
   integer :: i, j, t0, t1, count_rate, count_max

   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call drand128(A(i,j))
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time :", real(t1-t0)/count_rate
   print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

 contains

   impure elemental subroutine drand128(r)
      real*8, intent(out) :: r
      integer*8 :: s0 = 113, s1 = 19937
      s1 = xor(s0,s1)
      s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
      s1 = ior(ishft(s1,36), ishft(s1,-28))
      r = ishft(s0+s1, -1) / 9223372036854775808.d0
   end 

end program

好的，这是我的尝试。首先，我在x64或类似的ABI函数中，在寄存器中返回浮点值，比参数传输快得多。第二用乘法代替了最后的除法，尽管英特尔编译器可能会帮你

计时、英特尔i7 6820、WSL、Ubuntu 18.04：

before -   0.850000024
after  -   0.601000011

GNU Fortran 7.3.0，命令行

gfortran -std=gnu -O3 -ffast-math -mavx2 /mnt/c/Users/kkk/Documents/CPP/a.for

代码

直到现在我才意识到你在问这个特殊的PRNG。我自己也在用Fortran

我在链接中的代码比你的慢，因为它调用了多个子例程，目的是更通用。我打赌，让我们试着将我使用的代码压缩成一个子程序

让我们来比较一下您的代码和@SeverinAppadeux优化版本的性能，以及我的优化代码和Gfortran 4.8.5的性能

> gfortran -cpp -O3 -mtune=native xoroshiro.f90 

 Time drand128 sub:   1.80900002    
 Time drand128 fun:   1.80900002    
 Time rng_uni:   1.32900000

代码在这里，记住让CPU加速，循环的第一次迭代就是垃圾

program test_xoroshiro128plus
   use iso_fortran_env       
   implicit none
   integer, parameter :: n = 30000
   real*8  :: A(n,n)
   real*4  :: B(n,n)
   integer :: i, j, k, t0, t1, count_rate, count_max       

   integer(int64) :: s1 = int(Z'1DADBEEFBAADD0D0', int64), s2 = int(Z'5BADD0D0DEADBEEF', int64)

!let the CPU spin-up                                           
do k = 1, 3                                           
   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call drand128(A(i,j))
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 sub:", real(t1-t0)/count_rate

   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         A(i,j) = drand128_fun()
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 fun:", real(t1-t0)/count_rate


   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call rng_uni(A(i,j))
      end do
   end do
   call system_clock(t1)

   print *, "Time rng_uni:", real(t1-t0)/count_rate
end do

   print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

 contains

   impure elemental subroutine drand128(r)
      real*8, intent(out) :: r
      integer*8 :: s0 = 113, s1 = 19937
      s1 = xor(s0,s1)
      s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
      s1 = ior(ishft(s1,36), ishft(s1,-28))
      r = ishft(s0+s1, -1) / 9223372036854775808.d0
   end 

   impure elemental real*8 function drand128_fun()
     real*8, parameter :: c = 1.0d0/9223372036854775808.d0
     integer*8 :: s0 = 113, s1 = 19937
     s1 = xor(s0,s1)
     s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
     s1 = ior(ishft(s1,36), ishft(s1,-28))
     drand128_fun = ishft(s0+s1, -1) * c
  end

  impure elemental subroutine rng_uni(fn_val)
    real(real64), intent(inout) ::  fn_val
    integer(int64) :: ival

    ival = s1 + s2

    s2 = ieor(s2, s1)
    s1 = ieor( ieor(rotl(s1, 24), s2), shiftl(s2, 16))
    s2 = rotl(s2, 37)    

    ival  = ior(int(Z'3FF0000000000000',int64), shiftr(ival, 12))
    fn_val = transfer(ival, 1.0_real64) - 1;    
  end subroutine

  function rotl(x, k)
    integer(int64) :: rotl
    integer(int64) :: x
    integer :: k

    rotl = ior( shiftl(x, k), shiftr(x, 64-k))
  end function    

end program

主要的区别应该来自更快更好地将整数转换为实数的方法

如果您觉得无聊，可以尝试手动内联

rotl（）

，但我相信这里的编译器。

您可以分享您的编译和选项吗？我担心这个问题太广泛了，无法在这里回答。由于您声称已经有了不错的表现，我们可以说的任何其他内容都必须基于您未提及的内容。例如，您正在使用什么体系结构，使用哪些编译器和哪些编译器标志？对于任何优化驱动的更改的准确性，您有什么标准？等等。我在Windows 10上大部分时间都使用ifort，这台机器有一个像样的Intel CPU，支持64位操作。除了与MKL链接，有时启用/Qparallel外，我对准确性没有任何具体限制。可能是

ifort/fast/Qmkl/Qpar main。f90

是正常使用的标志。@boAmmar，我从来没有说过任何关于真正随机数的事情。我只是纠正了您的说法，即如果未首先调用

random\u seed

，则

random\u number

始终返回相同的序列。这是依赖于处理器的行为。不，Fortran的内部版本没有使用Mersenne绕圈器。每个Fortran供应商都使用供应商认为最好的算法。gfortran很久以前就用过MT，我把它撕掉了，因为它的质量很差。gfortran随后使用了4个独立的接吻生成器（接吻与Marsaglia prng中的接吻相同）。gfortran现在使用Vigna的xorshift prng.FYI之一，gfortran现在使用与此非常相关的prng“运行库实现xorshift 1024随机数生成器（RNG）。此生成器的周期为2^{1024}-1，当使用多个线程时，最多2^{512}个线程可以生成2^{512}“我在测试中看不到任何性能上的差异（见我的答案）。看起来这两个执行方式完全相同，您是否检查了优化的装配？当内联时，似乎很容易优化它。@VladimirF不，我没有检查组装，可能在周末做。尽管如此，通过我描述的设置和给定的OP原始计时测试，我清楚地看到了差异。确保像我做的那样进行一些循环，CPU核心必须首先意识到全速是必要的。只有在这之后，速度比较才有意义。你能清楚地描述一下你正在使用的平台和编译器（gfortran版本）吗？因为您可能有更好的编译器，它的内联方式不同etc@SeverinPappadeux我没有什么特别的或新的东西。OpenSuse 42.3中的gfortran 4.8.5默认值，CPU为i7-3770。我还尝试了Intel Fortran 16.0.1、sub和fun，结果是1.809秒，rng_uni 1.176秒，重复使用。使用gfortran 7时，原始版本优化为1.465秒，但sub和fun都是1.385秒。

program test_xoroshiro128plus
   use iso_fortran_env       
   implicit none
   integer, parameter :: n = 30000
   real*8  :: A(n,n)
   real*4  :: B(n,n)
   integer :: i, j, k, t0, t1, count_rate, count_max       

   integer(int64) :: s1 = int(Z'1DADBEEFBAADD0D0', int64), s2 = int(Z'5BADD0D0DEADBEEF', int64)

!let the CPU spin-up                                           
do k = 1, 3                                           
   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call drand128(A(i,j))
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 sub:", real(t1-t0)/count_rate

   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         A(i,j) = drand128_fun()
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 fun:", real(t1-t0)/count_rate


   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call rng_uni(A(i,j))
      end do
   end do
   call system_clock(t1)

   print *, "Time rng_uni:", real(t1-t0)/count_rate
end do

   print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

 contains

   impure elemental subroutine drand128(r)
      real*8, intent(out) :: r
      integer*8 :: s0 = 113, s1 = 19937
      s1 = xor(s0,s1)
      s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
      s1 = ior(ishft(s1,36), ishft(s1,-28))
      r = ishft(s0+s1, -1) / 9223372036854775808.d0
   end 

   impure elemental real*8 function drand128_fun()
     real*8, parameter :: c = 1.0d0/9223372036854775808.d0
     integer*8 :: s0 = 113, s1 = 19937
     s1 = xor(s0,s1)
     s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
     s1 = ior(ishft(s1,36), ishft(s1,-28))
     drand128_fun = ishft(s0+s1, -1) * c
  end

  impure elemental subroutine rng_uni(fn_val)
    real(real64), intent(inout) ::  fn_val
    integer(int64) :: ival

    ival = s1 + s2

    s2 = ieor(s2, s1)
    s1 = ieor( ieor(rotl(s1, 24), s2), shiftl(s2, 16))
    s2 = rotl(s2, 37)    

    ival  = ior(int(Z'3FF0000000000000',int64), shiftr(ival, 12))
    fn_val = transfer(ival, 1.0_real64) - 1;    
  end subroutine

  function rotl(x, k)
    integer(int64) :: rotl
    integer(int64) :: x
    integer :: k

    rotl = ior( shiftl(x, k), shiftr(x, 64-k))
  end function    

end program