Haskell中Perlin噪声的优化_Haskell_Optimization_Perlin Noise

Haskell中Perlin噪声的优化

haskell optimization

Haskell中Perlin噪声的优化,haskell,optimization,perlin-noise,Haskell,Optimization,Perlin Noise,（此程序的依赖项：vector--any和JuicyPixels>=2。代码可按以下方式提供。）我试着搬家对Haskell来说，但我不能完全确定我的方法是否正确。主体部分应该很好地推广到更高和更低的维度，但是这是以后的事情： perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a perlin3 p (!x', !y', !z') = let (!xX, !x

（此程序的依赖项：

vector--any

和

JuicyPixels>=2

。代码可按以下方式提供。）

我试着搬家对Haskell来说，但我不能完全确定我的方法是否正确。主体部分应该很好地推广到更高和更低的维度，但是这是以后的事情：

perlin3 :: (Ord a, Num a, RealFrac a, V.Unbox a) => Permutation -> (a, a, a) -> a
perlin3 p (!x', !y', !z')
  = let (!xX, !x) = actuallyProperFraction x'
        (!yY, !y) = actuallyProperFraction y'
        (!zZ, !z) = actuallyProperFraction z'

        !u = fade x
        !v = fade y
        !w = fade z

        !h = xX
        !a = next p h + yY
        !b = next p (h+1) + yY
        !aa = next p a + zZ
        !ab = next p (a+1) + zZ
        !ba = next p b + zZ
        !bb = next p (b+1) + zZ
        !aaa = next p aa
        !aab = next p (aa+1)
        !aba = next p ab
        !abb = next p (ab+1)
        !baa = next p ba
        !bab = next p (ba+1)
        !bba = next p bb
        !bbb = next p (bb+1)

    in
        lerp w
            (lerp v
                (lerp u
                    (grad aaa (x, y, z))
                    (grad baa (x-1, y, z)))
                (lerp u
                    (grad aba (x, y-1, z))
                    (grad bba (x-1, y-1, z))))
            (lerp v
                (lerp u
                    (grad aab (x, y, z-1))
                    (grad bab (x-1, y, z-1)))
                (lerp u
                    (grad abb (x, y-1, z-1))
                    (grad bbb (x-1, y-1, z-1))))

这当然伴随着

perlin3

功能，我希望它们尽可能高效：

fade :: (Ord a, Num a) => a -> a
fade !t | 0 <= t, t <= 1 = t * t * t * (t * (t * 6 - 15) + 10)

lerp :: (Ord a, Num a) => a -> a -> a -> a
lerp !t !a !b | 0 <= t, t <= 1 = a + t * (b - a)

grad :: (Bits hash, Integral hash, Num a, V.Unbox a) => hash -> (a, a, a) -> a
grad !hash (!x, !y, !z) = dot3 (vks `V.unsafeIndex` fromIntegral (hash .&. 15)) (x, y, z)
  where
    vks = V.fromList
        [ (1,1,0), (-1,1,0), (1,-1,0), (-1,-1,0)
        , (1,0,1), (-1,0,1), (1,0,-1), (-1,0,-1)
        , (0,1,1), (0,-1,1), (0,1,-1), (0,-1,-1)
        , (1,1,0), (-1,1,0), (0,-1,1), (0,-1,-1)
        ]

dot3 :: Num a => (a, a, a) -> (a, a, a) -> a
dot3 (!x0, !y0, !z0) (!x1, !y1, !z1) = x0 * x1 + y0 * y1 + z0 * z1

-- Unlike `properFraction`, `actuallyProperFraction` rounds as intended.
actuallyProperFraction :: (RealFrac a, Integral b) => a -> (b, a)
actuallyProperFraction x
  = let (ipart, fpart) = properFraction x
        r = if x >= 0 then (ipart, fpart)
                      else (ipart-1, 1+fpart)
    in r

所有这些都与JuicyPixels联系在一起：

main = do
    [target] <- getArgs
    let image = P.generateImage pixelRenderer 512 512
    P.writePng target image
  where
    pixelRenderer, pixelRenderer' :: Int -> Int -> Word8
    pixelRenderer !x !y
        = floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
          (fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128

    -- This code is much more readable, but also much slower.
    pixelRenderer' x y
        = (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
        . perlin3 permutation
        . (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
        $ (fromIntegral x, fromIntegral y, 0 :: Double)

main=do
[目标]下限美元（（w+1）/2*128））--w应位于[-1，+1]
. perlin3置换
. （\（x，y，z）->（（x-256）/32，（y-256）/32，（z-256）/32））
$（从整数x，从整数y，0:：Double）

我的问题是，

perlin3

对我来说似乎很慢。如果我对其进行配置，

pixelRenderer

也有很多时间，但我暂时不考虑。我不知道如何优化perlin3。我试着用爆炸模式暗示GHC，这会执行时间减半了，这很好。显式专门化和内联对ghc-O几乎没有帮助。perlin3应该这么慢吗

更新：这个问题的早期版本提到了我代码中的一个bug。这个问题已经解决了,；原来我以前的

版本实际上是错误的。它隐式地将浮点数的整数部分四舍五入到Word8
，然后从浮点数中减去它得到小数部分。由于Word8
只能获取0
和255
之间的值，因此对于该范围之外的数字（包括负数）来说，这将无法正常工作。
此代码似乎主要受计算限制。它可以稍微改进一点，但改进不了多少，除非有一种方法可以使用更少的数组查找和更少的算法
度量性能有两个有用的工具：评测和代码转储。我在perlin3中添加了一个SCC注释，这样它就会显示在概要文件中。然后我使用gcc-O2-fforce-recomp-ddump siml-prof-auto
进行编译。-ddump siml
标志打印简化代码
评测：在我的计算机上，运行程序需要0.60秒，根据评测，大约20%的执行时间（0.12秒）花在perlin3
上。请注意，我的个人资料信息的精度约为+/-3%
简化器输出：简化器生成相当干净的代码perlin3
内联到pixelRenderer
，因此这是您要查看的输出部分。大多数代码都由非固定数组读取和非固定算术组成。为了提高性能，我们希望消除一些这种算法
一个简单的更改是取消对SomeFraction
的运行时检查（这不会出现在您的问题中，但是您上载的代码的一部分）。这将程序的执行时间减少到0.56秒
-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t

基本操作窄8word
用于将Int
强制为Word8
。我们可以通过在next
的定义中使用Int
而不是Word8
来摆脱这种强制
next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
  = fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)

这将程序的执行时间减少到0.54秒。仅考虑到在perlin3
中花费的时间，执行时间（大致）已从0.12秒降至0.06秒。虽然很难衡量剩余时间的去向，但它很可能分布在剩余的算术和数组访问中。
在我的机器上，带有散热器优化的参考代码需要0.19秒
首先，我用我最喜欢的标志，-Odph-rtsopts-threaded-fno release case-funbox strict fields-feexpose all unfolings-funfolding-keeness-factor1000-fsimpl tick factor=500-fllvm-optlo-O3
（给出了它们），从JuicyPixels
移动到yarr
和yarr图像io

这是一个众所周知的问题（谷歌“haskell地板性能”）。执行时间缩短到52毫秒（0.052秒），几乎缩短了3倍
最后，为了好玩，我尝试并行计算噪声（dcomputer
，而不是命令行运行中的dComputeS
和+RTS-N4
）。程序耗时36毫秒，包括大约10毫秒的I/O常数。
您如何分析它？使用-auto-all
进行评测会禁用一些优化以更准确地评测。使用-auto-all
，相对于-auto
，我得到了2.5的减速系数。我得到了ghc-O-O/tmp/IPerlin-prof-rtsopts-auto-all-caf-all-fforce-recompiperlin.lhs
，然后将其称为/tmp/IPerlin+RTS-p-RTS/tmp/output.png
<代码>-auto

确实快得多，但现在评测报告几乎不包含任何信息（没有提到

perlin3

）。而且，我几乎不知道我应该寻找什么：我认为

grad

可以通过为

vk

使用不同的类型来改进。tuple的

Unbox

实例实际上将它们存储为数组的tuple。如果创建一个三元组类型和一个连续存储值的unbox实例，这应该是一个改进。使你的三重严格也会简化一些其他代码。所以我想我应该把重点放在优化

grad

dot3

和置换函数上，如果有的话。感谢您花时间查看：）注意，

someFraction

运行时检查只是为了查看我对某些值的假设（可能）是否正确，因此在生产代码中确实应该删除该检查。明天我将进一步研究它。虽然这并没有优化柏林噪声函数本身，但它大大减少了总执行时间。将地板更换为

-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t

                 case GHC.Prim.indexWord8Array#
                        ipv3_s23a
                        (GHC.Prim.+#
                           ipv1_s21N
                           (GHC.Prim.word2Int#
                              (GHC.Prim.and#
                                 (GHC.Prim.narrow8Word#
                                    (GHC.Prim.plusWord# ipv5_s256 (__word 1)))
                                 (__word 255))))

next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
  = fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)

import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...

main = do
    [target] <- getArgs
    image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
    Y.writeImage target (Grey image)
  where
    pixelRenderer, pixelRenderer' :: Dim2 -> Word8
    pixelRenderer (y, x)
        = floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
          (fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128

    -- This code is much more readable, but also much slower.
    pixelRenderer' (y, x)
        = (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
        . perlin3 permutation
        . (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
        $ (fromIntegral x, fromIntegral y, 0 :: Double)

doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)