Performance 如何在Haskell中将大数据块解析到内存中?

Performance 如何在Haskell中将大数据块解析到内存中?,performance,haskell,space-leak,Performance,Haskell,Space Leak,经过深思熟虑,整个问题可以归结为更简洁的问题。我正在寻找一个Haskell数据结构 看起来像一张单子 具有O(1)查找 有O(1)元素替换或O(1)元素附加(或前置…如果是这种情况,我可以反转索引查找)。我总是可以在写我以后的算法时考虑其中一个 内存开销非常小 我正在尝试构建一个图像文件解析器。文件格式是基本的8位彩色ppm文件,尽管我打算支持16位彩色文件以及PNG和JPEG文件。现有的Netpbm库,尽管有很多拆箱注释,但在尝试加载我使用的文件时,实际上会消耗所有可用内存: 3-10张

经过深思熟虑,整个问题可以归结为更简洁的问题。我正在寻找一个Haskell数据结构

  • 看起来像一张单子
  • 具有O(1)查找
  • 有O(1)元素替换或O(1)元素附加(或前置…如果是这种情况,我可以反转索引查找)。我总是可以在写我以后的算法时考虑其中一个
  • 内存开销非常小

我正在尝试构建一个图像文件解析器。文件格式是基本的8位彩色ppm文件,尽管我打算支持16位彩色文件以及PNG和JPEG文件。现有的Netpbm库,尽管有很多拆箱注释,但在尝试加载我使用的文件时,实际上会消耗所有可用内存:

3-10张照片,最小的45MB,最大的110MB

现在,我无法理解Netpbm代码中的优化,所以我决定自己尝试一下。这是一个简单的文件格式

我首先决定,无论文件格式是什么,我都将以这种格式存储未压缩的最终图像:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }
然后我编写了一个解析器,它可以处理三个向量,如下所示:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

我首先认为,只要简单地读取bytestring的整个片段,然后将内容解压缩到未绑定的向量中就足够了。事实上,即使没有神秘的空间泄漏,您发布的解析代码也相当糟糕:您在输入的每个字节上复制了所有三个向量的全部内容!谈论二次复杂性

所以我写了以下内容:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)
chunksOf3::[a]->[(a,a,a)]
chunksOf3(a:b:c:xs)=(a,b,c):chunksOf3-xs
chunksOf3=[]
parseRGB::Int->Atto.Parser(向量字8、向量字8、向量字8)
parseRGB size=do

input这是一个直接从磁盘解析文件而不将任何中间文件加载到内存中的版本:

import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap
导入控件。应用程序
进口管制.单子(无效)
导入Data.attopassec.ByteString(anyWord8)
导入Data.attopassec.ByteString.Char8(十进制)
将限定的Data.Attoparsec.ByteString作为Attoparsec导入
导入Data.ByteString(ByteString)
导入Data.Vector.unbox(矢量)
导入数据。Word(Word8)
导入控制.Foldl(FoldM(..),inpurely,vector,premapM)--使用'Foldl-1.0.3`
导入符合条件的管道。ByteString
导入管道。解析
导入管道.Attoparsec(解析,已解析)
将合格的System.IO导入为IO
数据像素地图=像素地图{
宽度::Int
,高度::Int
,redChannel::Vector Word8
,greenChannel::Vector Word8
,blueChannel::Vector Word8
}派生(显示)
--同时折叠三个向量,确保严格性和效率
rgbVector
::FoldM IO(Word8,Word8,Word8)(向量Word8,向量Word8,向量Word8)
rgbVector=
(,)premapM 1向量premapM 2向量premapM 3向量
哪里
_1(a,b,c)=a
_2(a,b,c)=b
_3(a,b,c)=c
三倍
::单子m
=>生产者通过测试环m r
->制作人(Word8,Word8,Word8)m()
三元组p=void$parsed((,)anyWord8 anyWord8 anyWord8)p
--我可能会要求Renzo简化'parse'的错误处理`
--这是一个只返回'Nothing'的帮助函数`
解析'
::单子m
=>attopassec.Parser r->Parser ByteString m(可能是r)
parse'parser=do
x只是r
_->没有
parsePixelMap::Producer ByteString IO r->IO(可能是PixelMap)
parsePixelMap p=do
让我来做
mw->Pipes.ByteString.take size)
(rs、gs、bs)做什么

你考虑过使用pixelMap吗?如果要在映像上执行的操作不是IO密集型的,则可能值得使用它。另一种选择是使用一些专门用于大型阵列的库,例如或甚至。它们是以高性能为目标编写的,因此它们还应该有许多内存效率优化。45MB图像的像素大小是多少?您的bitbucket软件包依赖于另一个名为“alyra common”的软件包,但当前版本不匹配(>0.2.1必需,但我们有0.2.0)。你能更新一下你的比特桶吗?@Mau不,没那么大。4767*3195*3就是45MB。我很确定,在一个实例中,额外的内存是修改纯数据结构的开销,因此构造函数的所有实例都会反复出现。内存复制GC行为令人难以置信。我从一个向量开始,因为我假设结构共享和所有这些都会阻止更新时复制操作。但从技术上讲,这毫无意义。我从普通列表中获得的所有内存开销可能都是允许结构共享所必需的。我学习使用多少库似乎无关紧要。每次我进入任何一个新的库时,我总是感觉自己像一个Haskell新手。最后,你为什么使用
不安全的
操作而不是正常的操作?@SavanniD'Gerinel这只是因为我已经通过初始化确定了边界的正确性,所以我不妨跳过边界检查。在大多数编程语言中,我认为不安全索引的广泛使用非常糟糕,但在Haskell ST代码中非常罕见,当它被使用时,我们就真正关心速度,所以我最好在这些情况下尽全力。请注意,
unsafeFreeze
更合理,因为除了在该点返回向量外,我们实际上对向量不做任何处理。
chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)
import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]
import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap