Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/haskell/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Haskell 提高序列化期间的内存使用率(Data.Binary)_Haskell_Memory_Serialization_Memory Leaks - Fatal编程技术网

Haskell 提高序列化期间的内存使用率(Data.Binary)

Haskell 提高序列化期间的内存使用率(Data.Binary),haskell,memory,serialization,memory-leaks,Haskell,Memory,Serialization,Memory Leaks,我对Haskell还是个新手,每天都在学习新东西。我的问题是在使用Data.Binary库进行序列化时内存使用率过高。也许我只是用错了图书馆,但我想不出来 实际的想法是,我从磁盘读取二进制数据,添加新数据,然后将所有内容写回磁盘。代码如下: module Main where import Data.Binary import System.Environment import Data.List (foldl') data DualNo = DualNo Int Int derivin

我对Haskell还是个新手,每天都在学习新东西。我的问题是在使用Data.Binary库进行序列化时内存使用率过高。也许我只是用错了图书馆,但我想不出来

实际的想法是,我从磁盘读取二进制数据,添加新数据,然后将所有内容写回磁盘。代码如下:

module Main
  where

import Data.Binary
import System.Environment
import Data.List (foldl')

data DualNo = DualNo Int Int deriving (Show)

instance Data.Binary.Binary DualNo where
  put (DualNo a b) = do
    put a
    put b
  get = do
    a <- get
    b <- get
    return (DualNo a b)

-- read DualNo from HDD
readData :: FilePath -> IO [DualNo]
readData filename = do
  no <- decodeFile filename :: IO [DualNo]
  return no

-- write DualNo to HDD
writeData  :: [DualNo] -> String -> IO ()
writeData no filename = encodeFile filename (no :: [DualNo])

writeEmptyDataToDisk :: String -> IO ()
writeEmptyDataToDisk filename = writeData [] filename

-- feed a the list with a new dataset
feedWithInputData :: [DualNo] -> [(Int, Int)] -> [DualNo]
feedWithInputData existData newData = foldl' func existData newData
  where
    func dataset (a,b) = DualNo a b : dataset

main :: IO ()
main = do
  [newInputData, toPutIntoExistingData] <- System.Environment.getArgs
  if toPutIntoExistingData == "empty"
    then writeEmptyDataToDisk "myData.dat"
    else return ()
  loadedData <- readData "myData.dat"
  newData <- return (case newInputData of
                        "dataset1" -> feedWithInputData loadedData dataset1
                        "dataset2" -> feedWithInputData loadedData dataset2
                        otherwise  -> feedWithInputData loadedData dataset3)
  writeData newData "myData.dat"

dataset1 = zip [1..100000]    [2,4..200000]
dataset2 = zip [5,10..500000] [3,6..300000]
dataset3 = zip [4,8..400000]  [6,12..600000]
查看prof文件:

    Tue Apr 12 18:11 2016 Time and Allocation Profiling Report  (Final)

       Main +RTS -p -sstderr -RTS dataset1 empty

    total time  =        0.06 secs   (60 ticks @ 1000 us, 1 processor)
    total alloc = 102,613,008 bytes  (excludes profiling overheads)

COST CENTRE            MODULE    %time %alloc

put                    Main       48.3   53.0
writeData              Main       30.0   18.8
dataset1               Main       13.3   23.4
feedWithInputData      Main        6.7    0.0
feedWithInputData.func Main        1.7    4.7


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     68           0    0.0    0.0   100.0  100.0
 main                     Main                    137           0    0.0    0.0    86.7   76.6
  feedWithInputData       Main                    150           1    6.7    0.0     8.3    4.7
   feedWithInputData.func Main                    154      100000    1.7    4.7     1.7    4.7
  writeData               Main                    148           1   30.0   18.8    78.3   71.8
   put                    Main                    155      100000   48.3   53.0    48.3   53.0
  readData                Main                    147           0    0.0    0.1     0.0    0.1
  writeEmptyDataToDisk    Main                    142           0    0.0    0.0     0.0    0.1
   writeData              Main                    143           0    0.0    0.1     0.0    0.1
 CAF:main1                Main                    133           0    0.0    0.0     0.0    0.0
  main                    Main                    136           1    0.0    0.0     0.0    0.0
 CAF:main2                Main                    132           0    0.0    0.0     0.0    0.0
  main                    Main                    139           0    0.0    0.0     0.0    0.0
   writeEmptyDataToDisk   Main                    140           1    0.0    0.0     0.0    0.0
    writeData             Main                    141           1    0.0    0.0     0.0    0.0
 CAF:main7                Main                    131           0    0.0    0.0     0.0    0.0
  main                    Main                    145           0    0.0    0.0     0.0    0.0
   readData               Main                    146           1    0.0    0.0     0.0    0.0
 CAF:dataset1             Main                    123           0    0.0    0.0     5.0    7.8
  dataset1                Main                    151           1    5.0    7.8     5.0    7.8
 CAF:dataset4             Main                    122           0    0.0    0.0     5.0    7.8
  dataset1                Main                    153           0    5.0    7.8     5.0    7.8
 CAF:dataset5             Main                    121           0    0.0    0.0     3.3    7.8
  dataset1                Main                    152           0    3.3    7.8     3.3    7.8
 CAF:main4                Main                    116           0    0.0    0.0     0.0    0.0
  main                    Main                    138           0    0.0    0.0     0.0    0.0
 CAF:main6                Main                    115           0    0.0    0.0     0.0    0.0
  main                    Main                    149           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    113           0    0.0    0.0     0.0    0.0
  main                    Main                    144           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal         107           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding         103           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv   101           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         94           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.FD                86           0    0.0    0.0     0.0    0.0
    Tue Apr 12 18:15 2016 Time and Allocation Profiling Report  (Final)

       Main +RTS -p -sstderr -RTS dataset2 myData.dat

    total time  =        0.14 secs   (139 ticks @ 1000 us, 1 processor)
    total alloc = 213,866,232 bytes  (excludes profiling overheads)

COST CENTRE            MODULE    %time %alloc

put                    Main       41.0   50.9
writeData              Main       25.9   18.0
get                    Main       25.2   16.8
dataset2               Main        4.3   11.2
readData               Main        1.4    0.8
feedWithInputData.func Main        1.4    2.2


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     68           0    0.0    0.0   100.0  100.0
 main                     Main                    137           0    0.0    0.0    95.7   88.8
  feedWithInputData       Main                    148           1    0.7    0.0     2.2    2.2
   feedWithInputData.func Main                    152      100000    1.4    2.2     1.4    2.2
  writeData               Main                    145           1   25.9   18.0    66.9   68.9
   put                    Main                    153      200000   41.0   50.9    41.0   50.9
  readData                Main                    141           0    1.4    0.8    26.6   17.6
   get                    Main                    144           0   25.2   16.8    25.2   16.8
 CAF:main1                Main                    133           0    0.0    0.0     0.0    0.0
  main                    Main                    136           1    0.0    0.0     0.0    0.0
 CAF:main7                Main                    131           0    0.0    0.0     0.0    0.0
  main                    Main                    139           0    0.0    0.0     0.0    0.0
   readData               Main                    140           1    0.0    0.0     0.0    0.0
 CAF:dataset2             Main                    126           0    0.0    0.0     0.7    3.7
  dataset2                Main                    149           1    0.7    3.7     0.7    3.7
 CAF:dataset6             Main                    125           0    0.0    0.0     2.2    3.7
  dataset2                Main                    151           0    2.2    3.7     2.2    3.7
 CAF:dataset7             Main                    124           0    0.0    0.0     1.4    3.7
  dataset2                Main                    150           0    1.4    3.7     1.4    3.7
 CAF:$fBinaryDualNo1      Main                    120           0    0.0    0.0     0.0    0.0
  get                     Main                    143           1    0.0    0.0     0.0    0.0
 CAF:main4                Main                    116           0    0.0    0.0     0.0    0.0
  main                    Main                    138           0    0.0    0.0     0.0    0.0
 CAF:main6                Main                    115           0    0.0    0.0     0.0    0.0
  main                    Main                    146           0    0.0    0.0     0.0    0.0
 CAF:main5                Main                    114           0    0.0    0.0     0.0    0.0
  main                    Main                    147           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    113           0    0.0    0.0     0.0    0.0
  main                    Main                    142           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal         107           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding         103           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv   101           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         94           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.FD                86           0    0.0    0.0     0.0    0.0
现在我添加更多数据:

$ ./Main dataset2 myData.dat +RTS -p -sstderr
     343,601,008 bytes allocated in the heap
     175,650,728 bytes copied during GC
      34,113,936 bytes maximum residency (8 sample(s))
         971,896 bytes maximum slop
              78 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       640 colls,     0 par    0.082s   0.083s     0.0001s    0.0017s
  Gen  1         8 colls,     0 par    0.140s   0.141s     0.0176s    0.0484s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.138s  (  0.139s elapsed)
  GC      time    0.221s  (  0.224s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.006s  (  0.006s elapsed)
  Total   time    0.370s  (  0.370s elapsed)

  %GC     time      59.8%  (60.5% elapsed)

  Alloc rate    2,485,518,518 bytes per MUT second

  Productivity  39.9% of total user, 39.8% of total elapsed
查看新的prof文件:

    Tue Apr 12 18:11 2016 Time and Allocation Profiling Report  (Final)

       Main +RTS -p -sstderr -RTS dataset1 empty

    total time  =        0.06 secs   (60 ticks @ 1000 us, 1 processor)
    total alloc = 102,613,008 bytes  (excludes profiling overheads)

COST CENTRE            MODULE    %time %alloc

put                    Main       48.3   53.0
writeData              Main       30.0   18.8
dataset1               Main       13.3   23.4
feedWithInputData      Main        6.7    0.0
feedWithInputData.func Main        1.7    4.7


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     68           0    0.0    0.0   100.0  100.0
 main                     Main                    137           0    0.0    0.0    86.7   76.6
  feedWithInputData       Main                    150           1    6.7    0.0     8.3    4.7
   feedWithInputData.func Main                    154      100000    1.7    4.7     1.7    4.7
  writeData               Main                    148           1   30.0   18.8    78.3   71.8
   put                    Main                    155      100000   48.3   53.0    48.3   53.0
  readData                Main                    147           0    0.0    0.1     0.0    0.1
  writeEmptyDataToDisk    Main                    142           0    0.0    0.0     0.0    0.1
   writeData              Main                    143           0    0.0    0.1     0.0    0.1
 CAF:main1                Main                    133           0    0.0    0.0     0.0    0.0
  main                    Main                    136           1    0.0    0.0     0.0    0.0
 CAF:main2                Main                    132           0    0.0    0.0     0.0    0.0
  main                    Main                    139           0    0.0    0.0     0.0    0.0
   writeEmptyDataToDisk   Main                    140           1    0.0    0.0     0.0    0.0
    writeData             Main                    141           1    0.0    0.0     0.0    0.0
 CAF:main7                Main                    131           0    0.0    0.0     0.0    0.0
  main                    Main                    145           0    0.0    0.0     0.0    0.0
   readData               Main                    146           1    0.0    0.0     0.0    0.0
 CAF:dataset1             Main                    123           0    0.0    0.0     5.0    7.8
  dataset1                Main                    151           1    5.0    7.8     5.0    7.8
 CAF:dataset4             Main                    122           0    0.0    0.0     5.0    7.8
  dataset1                Main                    153           0    5.0    7.8     5.0    7.8
 CAF:dataset5             Main                    121           0    0.0    0.0     3.3    7.8
  dataset1                Main                    152           0    3.3    7.8     3.3    7.8
 CAF:main4                Main                    116           0    0.0    0.0     0.0    0.0
  main                    Main                    138           0    0.0    0.0     0.0    0.0
 CAF:main6                Main                    115           0    0.0    0.0     0.0    0.0
  main                    Main                    149           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    113           0    0.0    0.0     0.0    0.0
  main                    Main                    144           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal         107           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding         103           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv   101           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         94           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.FD                86           0    0.0    0.0     0.0    0.0
    Tue Apr 12 18:15 2016 Time and Allocation Profiling Report  (Final)

       Main +RTS -p -sstderr -RTS dataset2 myData.dat

    total time  =        0.14 secs   (139 ticks @ 1000 us, 1 processor)
    total alloc = 213,866,232 bytes  (excludes profiling overheads)

COST CENTRE            MODULE    %time %alloc

put                    Main       41.0   50.9
writeData              Main       25.9   18.0
get                    Main       25.2   16.8
dataset2               Main        4.3   11.2
readData               Main        1.4    0.8
feedWithInputData.func Main        1.4    2.2


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     68           0    0.0    0.0   100.0  100.0
 main                     Main                    137           0    0.0    0.0    95.7   88.8
  feedWithInputData       Main                    148           1    0.7    0.0     2.2    2.2
   feedWithInputData.func Main                    152      100000    1.4    2.2     1.4    2.2
  writeData               Main                    145           1   25.9   18.0    66.9   68.9
   put                    Main                    153      200000   41.0   50.9    41.0   50.9
  readData                Main                    141           0    1.4    0.8    26.6   17.6
   get                    Main                    144           0   25.2   16.8    25.2   16.8
 CAF:main1                Main                    133           0    0.0    0.0     0.0    0.0
  main                    Main                    136           1    0.0    0.0     0.0    0.0
 CAF:main7                Main                    131           0    0.0    0.0     0.0    0.0
  main                    Main                    139           0    0.0    0.0     0.0    0.0
   readData               Main                    140           1    0.0    0.0     0.0    0.0
 CAF:dataset2             Main                    126           0    0.0    0.0     0.7    3.7
  dataset2                Main                    149           1    0.7    3.7     0.7    3.7
 CAF:dataset6             Main                    125           0    0.0    0.0     2.2    3.7
  dataset2                Main                    151           0    2.2    3.7     2.2    3.7
 CAF:dataset7             Main                    124           0    0.0    0.0     1.4    3.7
  dataset2                Main                    150           0    1.4    3.7     1.4    3.7
 CAF:$fBinaryDualNo1      Main                    120           0    0.0    0.0     0.0    0.0
  get                     Main                    143           1    0.0    0.0     0.0    0.0
 CAF:main4                Main                    116           0    0.0    0.0     0.0    0.0
  main                    Main                    138           0    0.0    0.0     0.0    0.0
 CAF:main6                Main                    115           0    0.0    0.0     0.0    0.0
  main                    Main                    146           0    0.0    0.0     0.0    0.0
 CAF:main5                Main                    114           0    0.0    0.0     0.0    0.0
  main                    Main                    147           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    113           0    0.0    0.0     0.0    0.0
  main                    Main                    142           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal         107           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding         103           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv   101           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         94           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.FD                86           0    0.0    0.0     0.0    0.0
我添加新数据的频率越高,内存使用率就越高。我的意思是,很明显,我需要更多的内存来存储更大的数据集。但是对于这个问题没有更好的解决方案(比如逐渐将数据写回磁盘)

编辑: 事实上,让我困扰的最重要的事情是以下观察:

  • 我第一次运行该程序,并将新数据添加到磁盘上的现有(空)文件中。 磁盘上保存的文件大小为:1.53 MB。 但是(查看第一个prof文件)该程序分配的数据超过了102mbyte。超过50%是由put函数从Data.Binary包中分配的
  • 我再次运行该程序,并将新数据添加到磁盘上的现有(非空)文件中。 磁盘上保存的文件大小为3.05 MB。 但是(查看第二个prof文件)该程序分配的内存超过了213 MB。超过66%是由put and get函数一起分配的
  • =>结论:在第一个示例中,运行程序所需的内存是磁盘上二进制文件空间的102/1.53=66倍。 在第二个示例中,运行程序所需的内存是磁盘上二进制文件空间的213/3.05=69倍

    问题: 用于序列化的Data.Binary包是否如此高效(而且非常棒),以至于可以将所需的内存减少到这样的程度。 类似问题:
    我真的需要更多的内存来加载程序中的数据,而不是磁盘上二进制文件中相同数据的空间吗?

    如果不需要立即将整个文件存储在内存中,您可以查看流。这可能会有帮助:更大的数据集=>更多的内存?如果您正在流式传输数据,则不会这样做,我认为正是因为这个原因,binary才使用lazy bytestring。检查代码中的惰性(当您强制/共享值时)。为什么
    feedWithInputData
    反转您的输入列表?这不会流。谢谢你的流提示,但我认为流不是一个选项。这里的代码实际上是我正在处理的项目的一个简化示例。这里我使用一个简单的数据结构,比如:data DualNo=DualNo Int.102 MB是程序分配的内存,而不是它占用的内存(堆峰值)。主要是在GC期间,可能需要复制和移动一些数据。通过堆图,程序的堆峰值(
    dataset1 empty
    )约为10MB。有关如何生成程序的堆图,请参见。