Performance 如何在Haskell中将大数据块解析到内存中?

经过深思熟虑,整个问题可以归结为更简洁的问题。我正在寻找一个Haskell数据结构


  • 看起来像一张单子
  • 具有O(1)查找
  • 有O(1)元素替换或O(1)元素附加(或前置…如果是这种情况,我可以反转索引查找)。我总是可以在写我以后的算法时考虑其中一个
  • 内存开销非常小





import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }



chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)
parseRGB size=do


import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap
你考虑过使用pixelMap吗?如果要在映像上执行的操作不是IO密集型的,则可能值得使用它。另一种选择是使用一些专门用于大型阵列的库,例如或甚至。它们是以高性能为目标编写的,因此它们还应该有许多内存效率优化。45MB图像的像素大小是多少?您的bitbucket软件包依赖于另一个名为“alyra common”的软件包,但当前版本不匹配(>0.2.1必需,但我们有0.2.0)。你能更新一下你的比特桶吗?@Mau不,没那么大。4767*3195*3就是45MB。我很确定,在一个实例中,额外的内存是修改纯数据结构的开销,因此构造函数的所有实例都会反复出现。内存复制GC行为令人难以置信。我从一个向量开始,因为我假设结构共享和所有这些都会阻止更新时复制操作。但从技术上讲,这毫无意义。我从普通列表中获得的所有内存开销可能都是允许结构共享所必需的。我学习使用多少库似乎无关紧要。每次我进入任何一个新的库时,我总是感觉自己像一个Haskell新手。最后,你为什么使用
操作而不是正常的操作?@SavanniD'Gerinel这只是因为我已经通过初始化确定了边界的正确性,所以我不妨跳过边界检查。在大多数编程语言中,我认为不安全索引的广泛使用非常糟糕,但在Haskell ST代码中非常罕见,当它被使用时,我们就真正关心速度,所以我最好在这些情况下尽全力。请注意,
