Multithreading Haskell/GHC每线程内存成本_Multithreading_Haskell_Memory_Ghc

Multithreading Haskell/GHC每线程内存成本

multithreading haskell memory

Multithreading Haskell/GHC每线程内存成本,multithreading,haskell,memory,ghc,Multithreading,Haskell,Memory,Ghc,我试图理解Haskell（OS X 10.10.5上的GHC 7.10.1）中的（绿色）线程到底有多贵。我知道，与真正的OS线程相比，无论是内存使用还是CPU使用，它都非常便宜对，所以我开始用forksn（绿色）线程（使用优秀的库）编写一个超级简单的程序，然后让每个线程休眠m秒这很简单： $ cat PerTheadMem.hs import Control.Concurrent (threadDelay) import Control.Concurrent.Async (mapConcu

我试图理解Haskell（OS X 10.10.5上的GHC 7.10.1）中的（绿色）线程到底有多贵。我知道，与真正的OS线程相比，无论是内存使用还是CPU使用，它都非常便宜

对，所以我开始用forks

（绿色）线程（使用优秀的库）编写一个超级简单的程序，然后让每个线程休眠

秒

这很简单：

$ cat PerTheadMem.hs 
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]

这将分叉10万个线程，每个线程等待10秒，然后向我们打印一些信息：

$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0       648 colls,     0 par    0.373s   0.415s     0.0006s    0.0223s
Gen  1        11 colls,     0 par    0.298s   0.431s     0.0392s    0.1535s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time   79.062s  ( 92.803s elapsed)
GC      time    0.670s  (  0.846s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.065s  (  0.091s elapsed)
Total   time   79.798s  ( 93.740s elapsed)

%GC     time       0.8%  (0.9% elapsed)

Alloc rate    4,312,344 bytes per MUT second

Productivity  99.2% of total user, 84.4% of total elapsed


real    1m33.757s
user    1m19.799s
sys 0m2.260s

它花费了相当长的时间（1m33.757），因为每个线程只需等待10秒，但我们现在已经构建了非线程的线程。总而言之，我们使用了350MB，这还不算太糟糕，即每个线程3.5kB。假设初始堆栈大小为（）

对，但现在让我们以线程模式编译，看看是否可以更快：

$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0      6435 colls,     0 par    0.860s   1.022s     0.0002s    0.0028s
Gen  1        20 colls,     0 par    2.206s   2.740s     0.1370s    0.3874s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.000s  (  0.001s elapsed)
MUT     time    0.879s  (  8.534s elapsed)
GC      time    3.066s  (  3.762s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.074s  (  0.247s elapsed)
Total   time    4.021s  ( 12.545s elapsed)

Alloc rate    4,544,893,364 bytes per MUT second

Productivity  23.7% of total user, 7.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m12.565s
user    0m4.021s
sys 0m1.154s

哇，快多了，现在才12秒，好多了。从ActivityMonitor中，我看到它大约使用了4个OS线程来处理100k个绿色线程，这是有道理的

但是，总内存为3657 MB！这比使用的非线程版本多10倍

到目前为止，我没有使用

-prof

或

-hy

之类的工具进行任何评测。为了进一步研究，我在单独的运行中进行了一些堆分析（

-hy

）。在这两种情况下，内存使用率都没有改变，堆分析图看起来很有趣（左：非线程，右：线程），但我找不到10倍差异的原因。

区分分析输出（

.prof

文件），我也找不到任何真正的区别。

因此，我的问题是：内存使用量的10倍差异来自哪里

编辑：仅提一下：当程序甚至没有使用评测支持编译时，同样的区别也适用。因此，使用

ghc-rtsopts-threaded-fforce recomp perthreadmem.hs

运行

time./perthreadmem 100000 10+RTS-sstderr

时的容量为3559 MB。使用ghc-rtsopts-fforce recomppertheadmem.hs395MB

编辑2：在Linux上（

GHC 7.10.2

Linux 3.13.0-32-generic#57 Ubuntu SMP，x86_64

）也会发生同样的情况：1m28.538s中的非线程460 MB和线程3483 MB是12.604s

/usr/bin/time-v…

分别报告

最大驻留集大小（KB）:413684

和

最大驻留集大小（KB）:1645384

编辑3：还将程序更改为直接使用

forkIO

：

import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mvar <- newEmptyMVar
    mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
          [1..numThreads]
    mapM_ (\_ -> takeMVar mvar) [1..numThreads]

import Control.Concurrent（线程延迟，forkIO）
导入控制.Concurrent.MVar
导入控制.Monad（mapM)
导入System.Environment（getArgs）
main=do
args（读取numS:：Int，读取sleepS:：Int）
_->错误“错误参数”
mvar forkIO$threadDelay（sleep*1000*1000）>>putMVar mvar（））
[1..numThreads]
mapM \（\ \ \->takeMVar mvar）[1..numThreads]

而且它不会改变任何东西：非线程：152MB，线程：3308MB。

IMHO，罪魁祸首是threadDelay*threadDelay**使用大量内存。这是一个与你的程序相当的程序，它在内存方面表现得更好。它通过具有长时间运行的计算来确保所有线程并发运行

uBound = 38
lBound = 34

doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
                  a = doSomething (n-1) 
                  b = doSomething (n-2) 
                in a `seq` b `seq` (a + b)
              | otherwise = doSomething (n `mod` uBound )

e :: Chan Integer -> Int -> IO ()
e mvar i = 
    do
        let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) ) 
        y `seq` writeChan mvar y

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
            dld = (sleep*1000*1000) 
        chan <- newChan
        mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
        putStrLn "All threads created"
        mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
        putStrLn "All read"

每个线程的最大驻留空间约为1.5 kb。我对线程的数量和计算的运行长度做了一些调整。由于线程在forkIO之后立即开始工作，因此创建100000个线程实际上需要很长时间。但结果保持了1000个线程

这里是另一个程序，其中threadDelay已被“分解”，这个程序不使用任何CPU，可以轻松地用100000个线程执行：

e :: MVar () -> MVar () -> IO ()
e start end = 
    do
        takeMVar start
        putMVar end ()

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
        starts <- mapM (const newEmptyMVar ) [1..numThreads]
        ends <- mapM (const newEmptyMVar ) [1..numThreads]
        mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
        mapM_ (\ start -> putMVar start () ) starts
        putStrLn "All threads created"
        threadDelay (sleep * 1000 * 1000)
        mapM_ (\ end -> takeMVar end ) ends
        putStrLn "All done"

在我的i5上，创建100000个线程并放置“开始”mvar只需不到一秒钟的时间。峰值驻留时间约为每个线程778字节，一点也不差

通过检查threadDelay的实现，我们发现线程和非线程的情况实际上是不同的：

那么这里：

看起来很无辜。但是旧版本的base对那些调用threadDelay的人有一个神秘的拼写（记忆）厄运：

是否还有问题，很难说。然而，人们总是希望“现实生活”中的并发程序不需要有太多线程同时等待threadDelay。从现在起，我将密切关注threadDelay的使用情况。

我想知道评测会增加多少开销。在Linux下，您可以说服

time

输出内存统计信息。如果编译时不进行分析，并向操作系统询问内存统计信息，会发生什么情况？@MathematicalOrchid我总共运行了四次，两次没有进行分析（1次线程化/1次非线程化），两次进行了分析。

-sstderr

输出没有更改。照片来自后两次跑步。我还检查了Activity Monitor中的mem使用情况，我看不到w/和w/o评测之间有很大的差异。好的，值得一试。我现在没有主意了。：-}@MathematicalArchid顺便提一下，当我甚至没有使用评测支持进行编译（no

-prof-auto-all-caf-all

）时，同样的情况也发生了。你能检查哪些闭包占用了所有内存吗？我不认为您的代码做了任何可疑的事情，但内存占用通常是非严格计算的结果。。。我会尝试通过同时删除mapM来看看会发生什么，使用mapM_u2;和forkIO生成线程并没有那么多工作……哇！我可以确认，我刚刚将程序更改为使用

MVar

s，新的数字是：221MB非线程和282MB线程。从未想过

threadDelay

可能是个问题。非常感谢。

 $ ghc -rtsopts -O -threaded  test.hs
 $ ./test 200 10 +RTS -sstderr -N4

 133,541,985,480 bytes allocated in the heap
     176,531,576 bytes copied during GC
         356,384 bytes maximum residency (16 sample(s))
          94,256 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     64246 colls, 64246 par    1.185s   0.901s     0.0000s    0.0274s
  Gen  1        16 colls,    15 par    0.004s   0.002s     0.0001s    0.0002s

  Parallel GC work balance: 65.96% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time   63.747s  ( 16.333s elapsed)
  GC      time    1.189s  (  0.903s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   64.938s  ( 17.239s elapsed)

  Alloc rate    2,094,861,384 bytes per MUT second

  Productivity  98.2% of total user, 369.8% of total elapsed

gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2

e :: MVar () -> MVar () -> IO ()
e start end = 
    do
        takeMVar start
        putMVar end ()

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
        starts <- mapM (const newEmptyMVar ) [1..numThreads]
        ends <- mapM (const newEmptyMVar ) [1..numThreads]
        mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
        mapM_ (\ start -> putMVar start () ) starts
        putStrLn "All threads created"
        threadDelay (sleep * 1000 * 1000)
        mapM_ (\ end -> takeMVar end ) ends
        putStrLn "All done"

     129,270,632 bytes allocated in the heap
     404,154,872 bytes copied during GC
      77,844,160 bytes maximum residency (10 sample(s))
      10,929,688 bytes maximum slop
             165 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       128 colls,   128 par    0.178s   0.079s     0.0006s    0.0152s
  Gen  1        10 colls,     9 par    0.367s   0.137s     0.0137s    0.0325s

  Parallel GC work balance: 50.09% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.189s  ( 10.094s elapsed)
  GC      time    0.545s  (  0.217s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time    0.735s  ( 10.313s elapsed)

  Alloc rate    685,509,460 bytes per MUT second

  Productivity  25.9% of total user, 1.8% of total elapsed