如何在Haskell中将树数据结构保存为二进制文件_Haskell_Functional Programming_Binary Tree_Monads_Monad Transformers

如何在Haskell中将树数据结构保存为二进制文件

haskell functional-programming

如何在Haskell中将树数据结构保存为二进制文件,haskell,functional-programming,binary-tree,monads,monad-transformers,Haskell,Functional Programming,Binary Tree,Monads,Monad Transformers,我试图使用Haskell将一个简单（但相当大）的树结构保存到一个二进制文件中。结构看起来像这样： -- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int] --为简单起见，假设每个节点只有4个子节点数据树=节点[Tree]|叶[Int] 下面是我如何需要磁盘上的数据外观：每个节点从其子节点的四个32位偏移开始，然后跟随子节点我不太关心leaf，假设它只是n个连续的32位

我试图使用Haskell将一个简单（但相当大）的树结构保存到一个二进制文件中。结构看起来像这样： -- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int] --为简单起见，假设每个节点只有4个子节点数据树=节点[Tree]|叶[Int] 下面是我如何需要磁盘上的数据外观：

每个节点从其子节点的四个32位偏移开始，然后跟随子节点

我不太关心leaf，假设它只是n个连续的32位数字

出于实际目的，我需要一些节点标签或其他一些附加数据但现在我也不太在乎

我觉得Haskeller在编写二进制文件时的第一选择是Data.binary.Put库。但是我有一个问题。特别是，当我要将节点写入文件时，要写下子偏移量，我需要知道当前偏移量和每个子偏移量的大小

这不是Data.Binary.Put提供的东西，所以我认为这一定是Monad transformers的完美应用。尽管听起来很酷，很实用，但到目前为止，我还没有成功地使用这种方法

我问了另外两个问题，我认为这两个问题可以帮助我解决这个问题。我必须说，每次我都收到了非常好的答案，这些答案帮助我取得了更大的进步，但不幸的是，我仍然无法从整体上解决问题

到目前为止，它仍然泄漏了太多内存，不实用

我希望有一个使用这种功能性方法的解决方案，但也会对任何其他的解决方案表示感激。

< P>有两种基本的方法我会考虑。如果整个序列化结构可以很容易地放入内存中，那么可以将每个节点序列化为一个lazy bytestring，并使用每个节点的长度来计算相对于当前位置的偏移量

serializeTree (Leaf nums)  = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
 where
  childBs = map serializeTree subtrees
  offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
  header = runPut (mapM_ putInt32 $ init offsets)

另一个选项是，序列化节点后，返回并使用适当的数据重新写入偏移字段。如果树很大，这可能是唯一的选择，但我不知道有哪个序列化库支持这一点。这将涉及到在

IO

和

seek

中工作，并将其定位到正确的位置。

我认为您需要的是一个明确的双通道解决方案。第一种方法将树转换为带大小注释的树。这个过程迫使树，但事实上，不需要任何一元机器，也可以通过打结来完成。第二个过程是在普通的旧Put单子中，考虑到已经计算了大小注释，应该非常简单。

这里是由sclv提出的两个过程解决方案的实现

import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')

data Tree = Node [Tree] | Leaf [Word32] deriving Show

makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1

SizeTree模仿原始树，它不包含数据，但在每个节点上它存储树中相应子节点的大小。
我们需要在内存中有SizeTree，因此使其更紧凑是值得的（例如，用uboxed单词替换int）

使用内存中的SizeTree，可以以流式方式序列化原始树

putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
  putWord8 $ fromIntegral $ length xs          -- number of children
  mapM_ (putWord32be . fromIntegral . sz) ys   -- sizes of children
  sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
  putWord8 0                                   -- zero means 'leaf'
  putWord32be $ fromIntegral $ length xs       -- data length
  mapM_ putWord32be xs                         -- leaf data


mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
  where
    ys = map mkSizeTree xs
    sum' = foldl' (+) 0

下面是一个使用的实现，它是“二进制”包的一部分。我没有正确地分析它，但根据“top”，它会立即分配108mbytes，然后在其余的执行过程中一直保持这种状态

请注意，我还没有尝试读回数据，因此在我的大小和偏移量计算中可能存在潜在的错误

-- Paste this into TreeBinary.hs, and compile with
--    ghc -O2 --make TreeBinary.hs -o TreeBinary

module Main where


import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B

import Data.List (init)
import Data.Monoid
import Data.Word


-- -------------------------------------------------------------------
-- Test data.

data Tree = Node [Tree] | Leaf [Word32] deriving Show

-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)

-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int.  This is an example of lazy programming instead
-- of lazy evaluation. 

makeTree :: Tree
makeTree = makeTree1 9
  where makeTree1 0 = Leaf [0..100]
        makeTree1 n = Node [ makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1 ]

-- --------------------------------------------------------------------
-- The actual serialisation code.


-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches), 
                           baseLength + sum subLengths)
   where
      (branches, subLengths) = unzip $ map serialiseTree ts
      baseLength = fromIntegral $ 1 + 4 * length ts
      offsets = init $ scanl (+) baseLength subLengths


main = do
   putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
   BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

这棵树有多大，你能想象你要创建的文件有多大？这个问题的答案决定了您是否可以使用任何类型的put类型结构，或者您是否需要涉及单次遍历但修改结构中已写入部分的内容……二进制序列化通常需要知道要写入的数据的大小（例如，列表以长度作为前缀）。你能接受文本序列化（可能是更大的文件）吗？如果失败了，您可以通过写入中间文件并将它们缝合在一起来实现一些技巧（虽然很可怕，但也有可能）。此外，在测试代码中，输入是合成的-如果您的真实数据不是合成的，您可能会将其存储在内存中，这样正常的二进制序列化就不会强制执行堆中没有的任何内容。@sclv，上面的链接“我到目前为止得到了什么”指向我已经工作了一段时间的更大程序的摘录。在最初的程序中，我读取了一个具有类似结构的二进制文件，对其进行转换（主要是为了使每个节点没有太多子节点），然后想将其保存回去。源文件的大小在50MB到200MB之间，所以我想目标文件的大小应该是相似的。@stephen tetley，不幸的是，格式必须保持原样（对它有一些强制要求）。我在开发机器上有大约4GB的内存，我不介意把它花在数据上，但我认为有些超出我理解的东西占用了内存，远远超过了所需的内存。内存中是否已经存在树？或者，它是根据需求惰性地计算的？如果是后者，那么您的“泄漏”可能是整个树的创建。这会起作用，但是我认为我的解决方案会更节省内存。这是因为我的方法将允许在序列化每个子树之后收集它，并且bytestring序列化应该比实际的树小得多。@John——你可能是对的。您的解决方案实际上是一次性的，但不是完全流式的。

serialize mkTree size = runPut $ putTree (mkTree size) treeSize
  where
    treeSize = mkSizeTree $ mkTree size

main = L.writeFile "dump.bin" $ serialize makeTree 10

-- Paste this into TreeBinary.hs, and compile with
--    ghc -O2 --make TreeBinary.hs -o TreeBinary

module Main where


import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B

import Data.List (init)
import Data.Monoid
import Data.Word


-- -------------------------------------------------------------------
-- Test data.

data Tree = Node [Tree] | Leaf [Word32] deriving Show

-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)

-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int.  This is an example of lazy programming instead
-- of lazy evaluation. 

makeTree :: Tree
makeTree = makeTree1 9
  where makeTree1 0 = Leaf [0..100]
        makeTree1 n = Node [ makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1
                           , makeTree1 $ n - 1 ]

-- --------------------------------------------------------------------
-- The actual serialisation code.


-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches), 
                           baseLength + sum subLengths)
   where
      (branches, subLengths) = unzip $ map serialiseTree ts
      baseLength = fromIntegral $ 1 + 4 * length ts
      offsets = init $ scanl (+) baseLength subLengths


main = do
   putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
   BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree