Performance 优化一个被多次调用的简单解析器_Performance_Haskell_Attoparsec

Performance 优化一个被多次调用的简单解析器

performance haskell

Performance 优化一个被多次调用的简单解析器,performance,haskell,attoparsec,Performance,Haskell,Attoparsec,我使用attoparsec为自定义文件编写了一个解析器。分析报告指出，大约67%的内存分配是在名为tab的函数中完成的，该函数也消耗了最多的时间。选项卡功能非常简单： tab :: Parser Char tab = char '\t' 整个分析报告如下所示： ASnapshotParser +RTS -p -h -RTS total time = 37.88 secs (37882 ticks @ 1000 us, 1 processor)

我使用

attoparsec

为自定义文件编写了一个解析器。分析报告指出，大约67%的内存分配是在名为

tab

的函数中完成的，该函数也消耗了最多的时间。

选项卡

功能非常简单：

tab :: Parser Char
tab = char '\t'

整个分析报告如下所示：

       ASnapshotParser +RTS -p -h -RTS

    total time  =       37.88 secs   (37882 ticks @ 1000 us, 1 processor)
    total alloc = 54,255,105,384 bytes  (excludes profiling overheads)

COST CENTRE    MODULE                %time %alloc

tab            Main                   83.1   67.7
main           Main                    6.4    4.2
readTextDevice Data.Text.IO.Internal   5.5   24.0
snapshotParser Main                    4.7    4.0


                                                             individual     inherited
COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                     75           0    0.0    0.0   100.0  100.0
 CAF               Main                    149           0    0.0    0.0   100.0  100.0
  tab              Main                    156           1    0.0    0.0     0.0    0.0
  snapshotParser   Main                    153           1    0.0    0.0     0.0    0.0
  main             Main                    150           1    6.4    4.2   100.0  100.0
   doStuff         Main                    152     1000398    0.3    0.0    88.1   71.8
    snapshotParser Main                    154           0    4.7    4.0    87.7   71.7
     tab           Main                    157           0   83.1   67.7    83.1   67.7
   readTextDevice  Data.Text.IO.Internal   151       40145    5.5   24.0     5.5   24.0
 CAF               Data.Text.Array         142           0    0.0    0.0     0.0    0.0
 CAF               Data.Text.Internal      140           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.FD        122           0    0.0    0.0     0.0    0.0
 CAF               GHC.Conc.Signal         103           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding         101           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.FD               100           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding.Iconv    89           0    0.0    0.0     0.0    0.0
  main             Main                    155           0    0.0    0.0     0.0    0.0

我如何优化它

我正在解析的文件的整个代码大约为77MB。

将attoparsec更新到最新版本（）后，执行所需的时间从38秒减少到16秒。这是超过50%的加速。同时，它所消耗的内存也大大减少了。正如@JohnL所指出的，在启用了评测的情况下，结果差异很大。当我试图用最新版本的attoparsec库对其进行评测时，整个程序的执行耗时约64秒。

选项卡

是一个替罪羊。如果您定义

boo:：Parser（）；boo=return（）

并在

snapshotParser

定义中的每个绑定之前插入一个

boo

，成本分配将类似于：

 main             Main                    255           0   11.8   13.8   100.0  100.0
  doStuff         Main                    258     2097153    1.1    0.5    86.2   86.2
   snapshotParser Main                    260           0    0.4    0.1    85.1   85.7
    boo           Main                    262           0   71.0   73.2    84.8   85.5
     tab          Main                    265           0   13.8   12.3    13.8   12.3

因此，正如John L在评论中所建议的那样，分析器似乎正在转移解析结果分配的责任，可能是由于

attoparsec

代码的广泛内联

至于性能问题，关键的一点是，当您解析一个77MB的文本文件以构建一个包含一百万个元素的列表时，您希望文件处理是延迟的，而不是严格的。一旦解决了这个问题，在

doStuff

中解耦I/O和解析以及构建不带累加器的快照列表也会很有帮助。这是一个考虑到这一点的程序的修改版本

{-# LANGUAGE BangPatterns #-}
module Main where

import Data.Maybe
import Data.Attoparsec.Text.Lazy
import Control.Applicative
import qualified Data.Text.Lazy.IO as TL
import Data.Text (Text)
import qualified Data.Text.Lazy as TL

buildStuff :: TL.Text -> [Snapshot]
buildStuff text = case maybeResult (parse endOfInput text) of
  Just _ -> []
  Nothing -> case parse snapshotParser text of
      Done !i !r -> r : buildStuff i
      Fail _ _ _ -> []

main :: IO ()
main = do
  text <- TL.readFile "./snap.dat"
  let ss = buildStuff text
  print $ listToMaybe ss
    >> Just (fromIntegral (length $ show ss) / fromIntegral (length ss))

newtype VehicleId = VehicleId Int deriving Show
newtype Time = Time Int deriving Show
newtype LinkID = LinkID Int deriving Show
newtype NodeID = NodeID Int deriving Show
newtype LaneID = LaneID Int deriving Show

tab :: Parser Char
tab = char '\t'

-- UNPACK pragmas. GHC 7.8 unboxes small strict fields automatically;
-- however, it seems we still need the pragmas while profiling. 
data Snapshot = Snapshot {
  vehicle :: {-# UNPACK #-} !VehicleId,
  time :: {-# UNPACK #-} !Time,
  link :: {-# UNPACK #-} !LinkID,
  node :: {-# UNPACK #-} !NodeID,
  lane :: {-# UNPACK #-} !LaneID,
  distance :: {-# UNPACK #-} !Double,
  velocity :: {-# UNPACK #-} !Double,
  vehtype :: {-# UNPACK #-} !Int,
  acceler :: {-# UNPACK #-} !Double,
  driver :: {-# UNPACK #-} !Int,
  passengers :: {-# UNPACK #-} !Int,
  easting :: {-# UNPACK #-} !Double,
  northing :: {-# UNPACK #-} !Double,
  elevation :: {-# UNPACK #-} !Double,
  azimuth :: {-# UNPACK #-} !Double,
  user :: {-# UNPACK #-} !Int
  } deriving (Show)

-- No need for bang patterns here.
snapshotParser :: Parser Snapshot
snapshotParser = do
  sveh <- decimal
  tab
  stime <- decimal
  tab
  slink <- decimal
  tab
  snode <- decimal
  tab
  slane <- decimal
  tab
  sdistance <- double
  tab
  svelocity <- double
  tab
  svehtype <- decimal
  tab
  sacceler <- double
  tab
  sdriver <- decimal
  tab
  spassengers <- decimal
  tab
  seasting <- double
  tab
  snorthing <- double
  tab
  selevation <- double
  tab
  sazimuth <- double
  tab
  suser <- decimal
  endOfLine <|> endOfInput
  return $ Snapshot
    (VehicleId sveh) (Time stime) (LinkID slink) (NodeID snode)
    (LaneID slane) sdistance svelocity svehtype sacceler sdriver
    spassengers seasting snorthing selevation sazimuth suser

{-#语言模式}
模块主要在哪里
导入数据，也许吧
导入Data.Attoparsec.Text.Lazy
导入控制
将限定的Data.Text.Lazy.IO导入为TL
导入数据。文本（Text）
将限定的Data.Text.Lazy作为TL导入
buildStuff:：TL.Text->[快照]
buildStuff text=的大小写maybeResult（解析endOfInput文本）
只是->[]
Nothing->case parse snapshotParser文本
完成！我r->r：建筑材料i
失败u->[]
main:：IO（）
main=do
text>Just（fromIntegral（长度$show ss）/fromIntegral（长度ss））
newtype VehicleId=VehicleId Int派生显示
newtype Time=导出显示的时间Int
newtype LinkID=LinkID Int派生显示
newtype NodeID=NodeID Int派生显示
newtype LaneID=LaneID Int派生显示
选项卡：：解析器字符
tab=char'\t'
--打开pragmas。GHC 7.8自动解除小型严格字段的绑定；
--然而，在分析时，我们似乎仍然需要pragmas。
数据快照=快照{
车辆：{-#打开包装{-}！车辆ID，
时间：{-#打开包装{-}！时间，
链接：{-#解包{-}！LinkID，
节点：{-#解包{-}！节点ID，
莱恩：{-#拆包{-}！莱奈德，
距离：{-#解包{-}！双倍，
速度：{-#解包{-}！加倍，
vehtype:：{-#解包#-}！Int，
加速计：{-#拆包{-}！加倍，
驱动程序：{-#解包{-}！Int，
乘客：{-#打开包装{-}！Int，
伊斯汀：{-#拆包{-}！加倍，
北行：{-#拆包{-}！双倍，
立面图：{-#解包{-}！双人，
方位：{-#打开{-}！加倍，
用户：{-#解包#-}！Int
}派生（显示）
--这里不需要爆炸模式。
snapshotParser:：分析器快照
snapshotParser=do
sveh您确实有大量调用代码中的选项卡
。解析文件中的记录失败是否经常发生？您可能更适合将每一行拆分为一个String
s列表，然后将每个元素解析为相应的字段。这样所有的标签都会被预先解析。您还可以考虑尝试找到一个现有的CSV解析器（可能存在支持指定定界符的一个），对于这样的任务，它可能会更加优化。文件的格式完全符合解析器定义的格式。我将使用CSV库并在此处更新结果。但是，我仍然觉得内存消耗和所花费的时间太高了。我已经写了很多解析器，但不是用Haskell编写的。你在使用递归下降法吗？通常，递归下降解析器应该是IO绑定的。为了看它是否正确，或者为什么不正确，我使用了，其中很少需要。我想知道attoparsec
解析器的评测结果有多可靠。几乎所有内容都是内联的，并且在启用评测时，许多优化不会发生。分析运行的时间是否比不分析的执行时间长得多？@JohnL是的，你是对的。对于最新版本来说，这似乎很重要。但是对于我最初使用的旧版本的attoparsec，它似乎没有太大影响。谢谢，你所说的“这个版本是尽可能懒惰的，正如你所看到的，删除main中的除法，或者用last ss替换它”是什么意思？您删除解析器中的bang模式是因为记录数据声明未绑定且严格，还是因为任何其他原因？我所说的“尽可能懒”，是指，例如，如果您只请求最后一个元素，那么它将花费常量和少量内存。至于解析器中的bang模式，我删除了它们，因为它们使性能更差。这些字段的严格性确实使它们变得多余。