Haskell中字节流的高效流式传输和操作_Haskell_Streaming_Bytestring_Haskell Pipes_Bytestream

Haskell中字节流的高效流式传输和操作

haskell streaming

Haskell中字节流的高效流式传输和操作,haskell,streaming,bytestring,haskell-pipes,bytestream,Haskell,Streaming,Bytestring,Haskell Pipes,Bytestream,在为大型（）*编码的二进制文件编写反序列化程序时，我遇到了各种Haskell生成转换库。到目前为止，我知道有四个流媒体库：：广泛使用，具有非常仔细的资源管理：类似于导管（很好地揭示了导管和管道之间的差异）：提供有用的函数，如getWord32be，但流式处理示例很难使用：似乎是最容易使用的下面是一个简单的例子，说明了当我尝试使用conduct进行Word32流式传输时出现的问题。一个稍微现实一点的emaple将首先读取一个确定blob长度的Word32，然后生成该长度的惰性Byte

在为大型

（）*

编码的二进制文件编写反序列化程序时，我遇到了各种Haskell生成转换库。到目前为止，我知道有四个流媒体库：

：广泛使用，具有非常仔细的资源管理
：类似于
```
导管
```
（很好地揭示了
```
导管
```
和
```
管道
```
之间的差异）
：提供有用的函数，如getWord32be，但流式处理示例很难使用
：似乎是最容易使用的

下面是一个简单的例子，说明了当我尝试使用

conduct

进行

Word32

流式传输时出现的问题。一个稍微现实一点的emaple将首先读取一个确定blob长度的

Word32

，然后生成该长度的惰性

ByteString

（然后进一步反序列化）。但在这里，我只是尝试从二进制文件中以流式方式提取Word32：

module Main where

-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary

import           Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get              as G
import qualified Data.ByteString              as BS
import qualified Data.ByteString.Char8        as C
import qualified Data.ByteString.Lazy         as BL
import           Data.Conduit
import qualified Data.Conduit.Binary          as CB
import qualified Data.Conduit.List            as CL
import           Data.Word                    (Word32)
import           System.Environment           (getArgs)

-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
    G.runGet G.getWord32be $ BL.fromStrict bs

-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
    mbs <- await
    case mbs of
        Just bs -> do
            case C.null bs of
                False -> do
                    yield $ getWord32 bs
                    leftover $ BS.drop 4 bs
                    transform
                True -> return ()
        Nothing -> return ()

main :: IO ()
main = do
    filename <- fmap (!!0) getArgs  -- should check length getArgs
    result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
    print $ length result   -- is always 8188 for files larger than 32752 bytes

“坏”的意思是：在时间和空间上要求很高，不能处理解码异常。

您当前的问题是由您的使用方式造成的。该函数用于“提供一段剩余输入，供当前一元绑定中的下一个组件使用”，因此，当您在使用

transform

循环之前给它

bs

时，实际上是在丢弃bytestring的其余部分（即

bs

之后的内容）

基于您的代码的正确解决方案将使用

Data.Binary.Get

将您的

yield

剩余的组合替换为完全消耗每个块的组合。不过，一种更为实用的方法是使用二进制管道包，该包提供以下形式（its提供了“手动”实现的良好概念）：
一个警告是，如果总字节数不是4的倍数（即最后一个Word32
不完整），这将引发解析错误。在不太可能的情况下，这不是您想要的，一个懒惰的解决方法是简单地在输入bytestring上使用\bs->C.take（4*truncate（C.length bs/4））bs
。首先，我们将传入的未区分字节流分解为4字节的小块：
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n) 

然后我们将这些映射到Word32
s，并（在这里）对它们进行计数
main :: IO ()
main = do
   filename:_ <- getArgs
   IO.withFile filename IO.ReadMode $ \h -> do
     n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
     print n

然后，以下程序将打印有效4字节序列的解析
main :: IO ()
main = do
   filename:_ <- getArgs
   IO.withFile filename IO.ReadMode $ \h -> do
     runEffect $ chunksOfStrict 4 (Bytes.fromHandle h) 
                 >-> P.map getMaybeWord32
                 >-> P.concat  -- here `concat` eliminates maybes
                 >-> P.print 

这同样是粗糙的，因为它使用runGet
而不是runGetOrFail
，但这很容易修复。pipes的标准过程是在解析失败时停止流转换，并返回unparsed ByTestStream
如果您预期Word32s
是用于大数字的，因此您不希望将相应的字节流作为惰性bytestring进行累积，而是说将它们写入不同的文件而不进行累积，那么我们可以很容易地更改程序来做到这一点。这需要复杂地使用导管，但对于管道
和流媒体
来说，这是一个相对简单的解决方案，我想将其加入到环中。它重复使用splitAt
封装成状态
monad，提供与Data.Binary.Get的（子集）相同的接口。生成的[ByteString]
在main
中通过whileJust
overgetBlob
获得
module Main (main) where

import           Control.Monad.Loops
import           Control.Monad.State
import qualified Data.Binary.Get      as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import           Data.Int             (Int64)
import           Data.Word            (Word32)
import           System.Environment   (getArgs)

-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString

getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
    let (w, rest) = BL.splitAt 4 bs
    case BL.length w of
        4 -> (Just w', rest) where
            w' = G.runGet G.getWord32be w
        _ -> (Nothing, BL.empty)

getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs

getBlob :: Get (Maybe BL.ByteString)
getBlob = do
    ml <- getWord32be
    case ml of
        Nothing -> return Nothing
        Just l -> do
            blob <- getLazyByteString (fromIntegral l :: Int64)
            return $ Just blob

runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs

main :: IO ()
main = do
    fname <- head <$> getArgs
    bs <- BL.readFile fname
    let ls = runGet loop bs where
        loop = whileJust getBlob return
    print $ length ls

主模块（Main），其中
导入控制.Monad.Loops
进口控制单体状态
导入限定的Data.Binary.Get作为G（getWord32be，runGet）
将限定数据.ByteString.Lazy导入为BL
导入Data.Int（Int64）
导入数据。Word（Word32）
导入System.Environment（getArgs）
--这将模拟Data.Binary.Get.Get Monad
类型Get=State BL.ByteString
getWord32be:：Get（可能是Word32）
getWord32be=状态$\bs->do
设（w，rest）=BL.4bs
案例BL.w的长度
4->（只是w'，休息）在哪里
w'=G.runGet G.getWord32be w
_->（无，BL.empty）
getLazyByteString:：Int64->Get BL.ByteString
getLazyByteString n=state$\bs->BL.splitAt n bs
getBlob:：Get（可能是BL.ByteString）
getBlob=do
ml不返回任何内容
只要我做
blob BL.ByteString->a
梯级g bs=fst$运行状态g bs
main:：IO（）
main=do
fname是不是演示程序应该将整个输入分成4字节的块，并生成单词32s？是的。更广泛的目标是读取word32和大小可变的blob（lazy ByteStrings）。不相关，但对于简单的arg解析，我会写filename:@danidiaz确实是。或者head getArgs
哦，我明白了，“…在当前的一元绑定中”，这就解释了它。非常感谢。我按照您的建议插入了转换
，它可以工作，但它会消耗大量内存（35MB文件大约500 MB内存）。在使用Data.conductor.List
@mcmayer时，惰性似乎消失了，差不多就是这样。经验法则是，当使用流媒体库时，您不应该在列表中收集输出（就像您在编辑中添加的坏io流示例中的outputToList
），因为如果有大量输出，您确实不希望发生这种情况。相反，您应该使用适当的流消费者（在管道术语中称为“接收器”）。Michael的回答说明了使用管道的要点（在他的演示中，消费者是P.print
）；导管解决方案是类似的。对，这在概念上非常接近于使用pipes parse
来使用StateT（Producer-ByteString m x）mr
来消耗字节流（Producer-ByteStri）
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case  G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
  Left r -> Nothing
  Right (_, off, w32) -> Just w32

main :: IO ()
main = do
   filename:_ <- getArgs
   IO.withFile filename IO.ReadMode $ \h -> do
     runEffect $ chunksOfStrict 4 (Bytes.fromHandle h) 
                 >-> P.map getMaybeWord32
                 >-> P.concat  -- here `concat` eliminates maybes
                 >-> P.print 

module Main (main) where 
import Pipes 
import qualified Pipes.Prelude as P
import Pipes.Group (folds) 
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL 
import System.Environment ( getArgs )

splitLazy :: (Monad m, Integral n) =>
   n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
  (bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
  return (BL.fromChunks bss, rest)

measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
 (lbs, rest) <- lift $ splitLazy 4 bs
 if BL.length lbs /= 4
   then rest >-> P.drain -- in fact it will be empty
   else do
     let w32 = G.runGet G.getWord32be lbs
     (lbs', rest') <- lift $ splitLazy w32 bs
     yield lbs
     measureChunks rest

main :: IO ()
main = do
  filename:_ <- getArgs
  IO.withFile filename IO.ReadMode $ \h -> do
     runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print

module Main (main) where

import           Control.Monad.Loops
import           Control.Monad.State
import qualified Data.Binary.Get      as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import           Data.Int             (Int64)
import           Data.Word            (Word32)
import           System.Environment   (getArgs)

-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString

getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
    let (w, rest) = BL.splitAt 4 bs
    case BL.length w of
        4 -> (Just w', rest) where
            w' = G.runGet G.getWord32be w
        _ -> (Nothing, BL.empty)

getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs

getBlob :: Get (Maybe BL.ByteString)
getBlob = do
    ml <- getWord32be
    case ml of
        Nothing -> return Nothing
        Just l -> do
            blob <- getLazyByteString (fromIntegral l :: Int64)
            return $ Just blob

runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs

main :: IO ()
main = do
    fname <- head <$> getArgs
    bs <- BL.readFile fname
    let ls = runGet loop bs where
        loop = whileJust getBlob return
    print $ length ls