如何在Haskell中将树数据结构保存为二进制文件
我试图使用Haskell将一个简单(但相当大)的树结构保存到一个二进制文件中。结构看起来像这样: -- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int] --为简单起见,假设每个节点只有4个子节点 数据树=节点[Tree]|叶[Int] 下面是我如何需要磁盘上的数据外观:如何在Haskell中将树数据结构保存为二进制文件,haskell,functional-programming,binary-tree,monads,monad-transformers,Haskell,Functional Programming,Binary Tree,Monads,Monad Transformers,我试图使用Haskell将一个简单(但相当大)的树结构保存到一个二进制文件中。结构看起来像这样: -- For simplicity assume each Node has only 4 childs data Tree = Node [Tree] | Leaf [Int] --为简单起见,假设每个节点只有4个子节点 数据树=节点[Tree]|叶[Int] 下面是我如何需要磁盘上的数据外观: 每个节点从其子节点的四个32位偏移开始,然后跟随子节点 我不太关心leaf,假设它只是n个连续的32位
我希望有一个使用这种功能性方法的解决方案,但也会对任何其他的解决方案表示感激。 < P>有两种基本的方法我会考虑。如果整个序列化结构可以很容易地放入内存中,那么可以将每个节点序列化为一个lazy bytestring,并使用每个节点的长度来计算相对于当前位置的偏移量
serializeTree (Leaf nums) = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
where
childBs = map serializeTree subtrees
offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
header = runPut (mapM_ putInt32 $ init offsets)
另一个选项是,序列化节点后,返回并使用适当的数据重新写入偏移字段。如果树很大,这可能是唯一的选择,但我不知道有哪个序列化库支持这一点。这将涉及到在
IO
和seek
中工作,并将其定位到正确的位置。我认为您需要的是一个明确的双通道解决方案。第一种方法将树转换为带大小注释的树。这个过程迫使树,但事实上,不需要任何一元机器,也可以通过打结来完成。第二个过程是在普通的旧Put单子中,考虑到已经计算了大小注释,应该非常简单。这里是由sclv提出的两个过程解决方案的实现
import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')
data Tree = Node [Tree] | Leaf [Word32] deriving Show
makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1
SizeTree模仿原始树,它不包含数据,但在每个节点上它存储树中相应子节点的大小。我们需要在内存中有SizeTree,因此使其更紧凑是值得的(例如,用uboxed单词替换int) 使用内存中的SizeTree,可以以流式方式序列化原始树
putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
putWord8 $ fromIntegral $ length xs -- number of children
mapM_ (putWord32be . fromIntegral . sz) ys -- sizes of children
sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
putWord8 0 -- zero means 'leaf'
putWord32be $ fromIntegral $ length xs -- data length
mapM_ putWord32be xs -- leaf data
mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
where
ys = map mkSizeTree xs
sum' = foldl' (+) 0
下面是一个使用的实现,它是“二进制”包的一部分。我没有正确地分析它,但根据“top”,它会立即分配108mbytes,然后在其余的执行过程中一直保持这种状态 请注意,我还没有尝试读回数据,因此在我的大小和偏移量计算中可能存在潜在的错误
-- Paste this into TreeBinary.hs, and compile with
-- ghc -O2 --make TreeBinary.hs -o TreeBinary
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B
import Data.List (init)
import Data.Monoid
import Data.Word
-- -------------------------------------------------------------------
-- Test data.
data Tree = Node [Tree] | Leaf [Word32] deriving Show
-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)
-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int. This is an example of lazy programming instead
-- of lazy evaluation.
makeTree :: Tree
makeTree = makeTree1 9
where makeTree1 0 = Leaf [0..100]
makeTree1 n = Node [ makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1 ]
-- --------------------------------------------------------------------
-- The actual serialisation code.
-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches),
baseLength + sum subLengths)
where
(branches, subLengths) = unzip $ map serialiseTree ts
baseLength = fromIntegral $ 1 + 4 * length ts
offsets = init $ scanl (+) baseLength subLengths
main = do
putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree
这棵树有多大,你能想象你要创建的文件有多大?这个问题的答案决定了您是否可以使用任何类型的put类型结构,或者您是否需要涉及单次遍历但修改结构中已写入部分的内容……二进制序列化通常需要知道要写入的数据的大小(例如,列表以长度作为前缀)。你能接受文本序列化(可能是更大的文件)吗?如果失败了,您可以通过写入中间文件并将它们缝合在一起来实现一些技巧(虽然很可怕,但也有可能)。此外,在测试代码中,输入是合成的-如果您的真实数据不是合成的,您可能会将其存储在内存中,这样正常的二进制序列化就不会强制执行堆中没有的任何内容。@sclv,上面的链接“我到目前为止得到了什么”指向我已经工作了一段时间的更大程序的摘录。在最初的程序中,我读取了一个具有类似结构的二进制文件,对其进行转换(主要是为了使每个节点没有太多子节点),然后想将其保存回去。源文件的大小在50MB到200MB之间,所以我想目标文件的大小应该是相似的。@stephen tetley,不幸的是,格式必须保持原样(对它有一些强制要求)。我在开发机器上有大约4GB的内存,我不介意把它花在数据上,但我认为有些超出我理解的东西占用了内存,远远超过了所需的内存。内存中是否已经存在树?或者,它是根据需求惰性地计算的?如果是后者,那么您的“泄漏”可能是整个树的创建。这会起作用,但是我认为我的解决方案会更节省内存。这是因为我的方法将允许在序列化每个子树之后收集它,并且bytestring序列化应该比实际的树小得多。@John——你可能是对的。您的解决方案实际上是一次性的,但不是完全流式的。
serialize mkTree size = runPut $ putTree (mkTree size) treeSize
where
treeSize = mkSizeTree $ mkTree size
main = L.writeFile "dump.bin" $ serialize makeTree 10
-- Paste this into TreeBinary.hs, and compile with
-- ghc -O2 --make TreeBinary.hs -o TreeBinary
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B
import Data.List (init)
import Data.Monoid
import Data.Word
-- -------------------------------------------------------------------
-- Test data.
data Tree = Node [Tree] | Leaf [Word32] deriving Show
-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)
-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int. This is an example of lazy programming instead
-- of lazy evaluation.
makeTree :: Tree
makeTree = makeTree1 9
where makeTree1 0 = Leaf [0..100]
makeTree1 n = Node [ makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1 ]
-- --------------------------------------------------------------------
-- The actual serialisation code.
-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches),
baseLength + sum subLengths)
where
(branches, subLengths) = unzip $ map serialiseTree ts
baseLength = fromIntegral $ 1 + 4 * length ts
offsets = init $ scanl (+) baseLength subLengths
main = do
putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree