Performance 为什么-O2对Haskell中简单的L1距离计算器有如此大的影响？_Performance_Haskell_Optimization_Nearest Neighbor_Stream Fusion

Performance 为什么-O2对Haskell中简单的L1距离计算器有如此大的影响？

performance haskell optimization

Performance 为什么-O2对Haskell中简单的L1距离计算器有如此大的影响？,performance,haskell,optimization,nearest-neighbor,stream-fusion,Performance,Haskell,Optimization,Nearest Neighbor,Stream Fusion,我用Haskell实现了一个简单的L1距离计算器。因为我对性能感兴趣，所以我使用非固定向量来存储要比较的图像 calculateL1Distance :: LabeledImage -> LabeledImage -> Int calculateL1Distance reference test = let substractPixels :: Int -> Int -> Int subst

我用Haskell实现了一个简单的L1距离计算器。因为我对性能感兴趣，所以我使用非固定向量来存储要比较的图像

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

据我所知（我是Haskell的新手），流融合应该使这段代码作为一个简单的循环运行。所以应该很快。但是，使用

ghc -O -fforce-recomp -rtsopts -o test .\performance.hs

该程序耗时约60秒：

 198,871,911,896 bytes allocated in the heap
   1,804,017,536 bytes copied during GC
     254,900,000 bytes maximum residency (14 sample(s))
       9,020,888 bytes maximum slop
             579 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     378010 colls,     0 par    2.312s   2.949s     0.0000s    0.0063s
  Gen  1        14 colls,     0 par    0.562s   0.755s     0.0539s    0.2118s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   58.297s  ( 64.380s elapsed)
  GC      time    2.875s  (  3.704s elapsed)
  EXIT    time    0.016s  (  0.088s elapsed)
  Total   time   61.188s  ( 68.176s elapsed)

  %GC     time       4.7%  (5.4% elapsed)

  Alloc rate    3,411,364,878 bytes per MUT second

  Productivity  95.3% of total user, 94.6% of total elapsed

但是，使用编译时，性能显著提高

ghc -O2 -fforce-recomp -rtsopts -o test .\performance.hs

运行时间降至13秒左右：

   2,261,672,056 bytes allocated in the heap
   1,571,668,904 bytes copied during GC
     241,064,192 bytes maximum residency (12 sample(s))
       8,839,048 bytes maximum slop
             544 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2951 colls,     0 par    1.828s   1.927s     0.0007s    0.0059s
  Gen  1        12 colls,     0 par    0.516s   0.688s     0.0573s    0.2019s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   10.484s  ( 16.598s elapsed)
  GC      time    2.344s  (  2.615s elapsed)
  EXIT    time    0.000s  (  0.105s elapsed)
  Total   time   12.828s  ( 19.324s elapsed)

  %GC     time      18.3%  (13.5% elapsed)

  Alloc rate    215,718,348 bytes per MUT second

  Productivity  81.7% of total user, 86.4% of total elapsed

当使用图像集的较大部分时，效果甚至更强，因为图像加载占用运行时的较小部分。根据HaskellWiki的说法，-O和-O2（）之间实际上几乎没有区别。然而，我观察到一个巨大的影响。我想知道我是否遗漏了什么。使用-O2编译时，编译器（GHC）是否应该对代码进行优化？如果是，他是做什么的？据我所知，主要的性能改进来自流融合，从我看来，流融合的功能似乎可以应用

作为参考，这里是我的测试程序的完整示例

import Data.List
import Data.Word
import qualified Data.ByteString as ByteStr
import qualified Data.ByteString.Char8 as ByteStrCh8
import qualified Data.Vector.Unboxed as Vec

data LabeledImage = LabeledImage {
       labelIdx :: Int
     , pixels :: Vec.Vector Int
} deriving (Eq)

extractLabeledImages :: ByteStr.ByteString -> [LabeledImage] -> [LabeledImage]
extractLabeledImages source images
      | ByteStr.length source >= imgLength =
                    let
                      (label,trailData) = ByteStr.splitAt labelBytes source
                      (rgbData,remainingData) = ByteStr.splitAt colorBytes trailData
                      numLabel = fromIntegral (ByteStr.head label)
                      pixelValues = Vec.generate (ByteStr.length rgbData) (fromIntegral . ByteStr.index rgbData)
                    in
                      extractLabeledImages remainingData (images ++ [LabeledImage numLabel pixelValues])
      | otherwise = images
      where
        labelBytes = 1
        colorBytes = 3072
        imgLength = labelBytes + colorBytes

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

main = do
       batch1Raw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\data_batch_1.bin"
       testBatchRaw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\test_batch.bin"

       let referenceImages = take 1000 $ extractLabeledImages batch1Raw []
       let testImages = take 1000 $ extractLabeledImages testBatchRaw []

       putStrLn "Created image sets. Starting tests."
       let results = [calculateL1Distance referenceImage testImage | referenceImage <- referenceImages, testImage <- testImages ]
       ByteStr.writeFile "M:\\Documents\\StanfordCNN\\results.txt" (ByteStrCh8.pack $ show results)

导入数据。列表
导入数据.Word
将限定数据.ByteString作为ByteStr导入
将限定的Data.ByteString.Char8作为ByteStrCh8导入
导入符合条件的Data.Vector.Unbox为Vec
数据标签年龄=标签年龄{
labelIdx:：Int
，像素：：向量向量Int
}推导（Eq）
ExtractLabeledImage:：ByteStr.ByteString->[LabeledImage]->[LabeledImage]
提取标签图像源图像
|ByteStr.length源>=imgLength=
让
（label，trailData）=ByteStr.splitAt labelBytes源
（rgbData，remainingData）=ByteStr.splitAt colorBytes trailData
numLabel=fromIntegral（ByteStr.head标签）
pixelValues=Vec.generate（ByteStr.length rgbData）（from integral.ByteStr.index rgbData）
在里面
extractLabeledImages remainingData（图像+++[LabeledImage numLabel pixelValues]）
|否则=图像
哪里
labelBytes=1
colorBytes=3072
imgLength=labelBytes+colorBytes
calculateL1Distance:：LabeledImage->LabeledImage->Int
计算1距离参考测试=
让
减法像素：：Int->Int->Int
减法像素a b=abs$a-b
diff f=Vec.sum$Vec.zipWith substractPixels（f参考）（f测试）
在里面
差异像素
main=do
batch1Raw如果在运行-O2
版本后运行-O
版本，性能是否保持快速？您可能看到了从磁盘加载文件和从操作系统的页面缓存中检索文件之间的区别。我怀疑这就是区别，因为分配速率和分配的字节不同。-O2版本可能正在分配更少的中间对象。如果您想看到差异，最好的办法是查看编译后的核心（在链接文章中描述）。@Cirdec差异是由于-fspec constr
造成的。如果没有实际的数据来重现这一点，很难生成一些概要文件，但最终内联的向量
代码需要-fspec constr
来获得最后一点性能。此外，OP，根据您的用例，您希望{-#INLINE#-}
您的函数。请注意，calculateL1Distance
使用-O
时已经发生了这种情况，当您开始将这些函数拆分为其他模块时，这只是一个备注。@Zeta:我读到了关于-fspec constr（）的内容。这种优化绝对有意义。非常感谢你。有趣的阅读！至于数据。可以在这里找到：