Performance 从ghci和shell运行的编译加速代码的性能差异
问题 您好,我正在使用accelerate库创建一个应用程序,允许用户以交互方式调用处理图像的函数,这就是我基于ghc api并使用ghc api扩展ghci的原因 问题在于,当从shell运行编译后的可执行文件时,计算在100ms(略小于80)的时间内完成,而在ghci中运行相同的编译代码则需要100ms(平均略大于140)才能完成Performance 从ghci和shell运行的编译加速代码的性能差异,performance,haskell,profiling,ghci,accelerate-haskell,Performance,Haskell,Profiling,Ghci,Accelerate Haskell,问题 您好,我正在使用accelerate库创建一个应用程序,允许用户以交互方式调用处理图像的函数,这就是我基于ghc api并使用ghc api扩展ghci的原因 问题在于,当从shell运行编译后的可执行文件时,计算在100ms(略小于80)的时间内完成,而在ghci中运行相同的编译代码则需要100ms(平均略大于140)才能完成 $ ghc -O2 Main.hs -o main -threaded [1 of 1] Compiling Main ( Main.hs
$ ghc -O2 Main.hs -o main -threaded
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking main ...
$ ./main
Array (Z) [1000001.0]
0.092906s
资源
$ ghc -O2 Main.hs -c -dynamic
$ ghci Main
ghci> main
Array (Z) [1000001.0]
0.258224s
示例代码+执行日志:
说明
$ ghc -O2 -dynamic -c -threaded Main.hs && ghci
GHCi, version 7.8.3: http://www.haskell.org/ghc/ :? for help
…
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> Loading package transformers-0.3.0.0 ... linking ... done.
…
Loading package array-0.5.0.0 ... linking ... done.
(...)
Loading package accelerate-cuda-0.15.0.0 ... linking ... done.
>>>>> run
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<< runAsyncIn.seq ctx: 4.1609e-2 CPU 0.041493s TOTAL
>>>>> runAsyncIn.seq a
<<<<< runAsyncIn.seq a: 1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<< convertAccWith: 0.0 CPU 0.000017s TOTAL
<<<<< runAsyncIn.seq acc: 2.68e-4 CPU 0.000219s TOTAL
>>>>> evalCUDA
>>>>> push
<<<<< push: 0.0 CPU 0.000002s TOTAL
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<< compileOpenAcc.traveuseAcc.use1: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<< compileOpenAcc.traveuseAcc.seq arr: 0.105716 CPU 0.105501s TOTAL
>>>>> useArrayAsync
<<<<< useArrayAsync: 1.234e-3 CPU 0.001505s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use2: 0.108012 CPU 0.108015s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use3: 0.108539 CPU 0.108663s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Use: 0.109375 CPU 0.109005s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Fold1: 2.059e-3 CPU 0.002384s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Alet: 0.111434 CPU 0.112034s TOTAL
<<<<< compileOpenAcc: 0.11197 CPU 0.112615s TOTAL
<<<<< runAsyncIn.compileAcc: 0.11197 CPU 0.112833s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<< runAsyncIn.dumpStats: 2.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<< executeAcc: 8.96e-4 CPU 0.00049s TOTAL
<<<<< runAsyncIn.executeAcc: 9.36e-4 CPU 0.0007s TOTAL
>>>>> runAsyncIn.collect
<<<<< runAsyncIn.collect: 0.0 CPU 0.000027s TOTAL
<<<<< evalStateT: 0.114156 CPU 0.115327s TOTAL
>>>>> pop
<<<<< pop: 0.0 CPU 0.000002s TOTAL
>>>>> performGC
<<<<< performGC: 5.7246e-2 CPU 0.057814s TOTAL
<<<<< evalCUDA: 0.17295 CPU 0.173943s TOTAL
<<<<< runAsyncIn.execute: 0.215475 CPU 0.216563s TOTAL
<<<<< run: 0.215523 CPU 0.216771s TOTAL
Array (Z) [1000001.0]
0.217148s
Prelude Main> Leaving GHCi.
$ ghc -O2 -threaded Main.hs && ./Main
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
>>>>> run
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<< runAsyncIn.seq ctx: 4.0639e-2 CPU 0.041498s TOTAL
>>>>> runAsyncIn.seq a
<<<<< runAsyncIn.seq a: 1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<< convertAccWith: 1.2e-5 CPU 0.000005s TOTAL
<<<<< runAsyncIn.seq acc: 1.15e-4 CPU 0.000061s TOTAL
>>>>> evalCUDA
>>>>> push
<<<<< push: 2.0e-6 CPU 0.000002s TOTAL
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<< compileOpenAcc.traveuseAcc.use1: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<< compileOpenAcc.traveuseAcc.seq arr: 3.6651e-2 CPU 0.03712s TOTAL
>>>>> useArrayAsync
<<<<< useArrayAsync: 1.427e-3 CPU 0.001427s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use2: 3.8776e-2 CPU 0.039152s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use3: 3.8794e-2 CPU 0.039207s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Use: 3.8808e-2 CPU 0.03923s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Fold1: 1.342e-3 CPU 0.001284s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Alet: 4.0197e-2 CPU 0.040578s TOTAL
<<<<< compileOpenAcc: 4.0248e-2 CPU 0.040895s TOTAL
<<<<< runAsyncIn.compileAcc: 4.0834e-2 CPU 0.04103s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<< runAsyncIn.dumpStats: 0.0 CPU 0s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<< executeAcc: 2.87e-4 CPU 0.000403s TOTAL
<<<<< runAsyncIn.executeAcc: 2.87e-4 CPU 0.000488s TOTAL
>>>>> runAsyncIn.collect
<<<<< runAsyncIn.collect: 9.2e-5 CPU 0.000049s TOTAL
<<<<< evalStateT: 4.1213e-2 CPU 0.041739s TOTAL
>>>>> pop
<<<<< pop: 0.0 CPU 0.000002s TOTAL
>>>>> performGC
<<<<< performGC: 9.41e-4 CPU 0.000861s TOTAL
<<<<< evalCUDA: 4.3308e-2 CPU 0.042893s TOTAL
<<<<< runAsyncIn.execute: 8.5154e-2 CPU 0.084815s TOTAL
<<<<< run: 8.5372e-2 CPU 0.085035s TOTAL
Array (Z) [1000001.0]
0.085169s
首先:测试是在编译CUDA内核之后运行的(编译本身增加了2秒,但事实并非如此)
从shell运行编译后的可执行文件时,计算在10毫秒内完成。(shell第一次运行
和第二次shell运行
传递了不同的参数,以确保数据没有缓存到任何地方)
当试图从ghci运行相同的代码并处理输入数据时,计算需要超过100毫秒。我知道解释代码比编译代码慢,但我在ghci会话中加载相同的编译代码,并调用相同的顶级绑定(packedFunction
)。我已经显式地键入它以确保它是专门化的(与使用专门化的pragma的结果相同)
但是,如果我在ghci中运行main
函数(即使在连续调用之间使用:set args
更改输入数据),计算所需时间也不到10毫秒
使用ghc-o Main.hs-O2-dynamic-threaded编译Main.hs
我想知道开销是从哪里来的。有人对为什么会发生这种情况有什么建议吗
示例的简化版本由以下人员发布:
但当我预编译它并在解释器中运行时,需要0,25秒
$ ghc -O2 Main.hs -c -dynamic
$ ghci Main
ghci> main
Array (Z) [1000001.0]
0.258224s
我调查了accelerate
和accelerate cuda
,并在ghci下和编译优化版本中放置了一些调试代码来测量时间
结果如下,您可以看到堆栈跟踪和执行时间
ghci运行
$ ghc -O2 -dynamic -c -threaded Main.hs && ghci
GHCi, version 7.8.3: http://www.haskell.org/ghc/ :? for help
…
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> Loading package transformers-0.3.0.0 ... linking ... done.
…
Loading package array-0.5.0.0 ... linking ... done.
(...)
Loading package accelerate-cuda-0.15.0.0 ... linking ... done.
>>>>> run
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<< runAsyncIn.seq ctx: 4.1609e-2 CPU 0.041493s TOTAL
>>>>> runAsyncIn.seq a
<<<<< runAsyncIn.seq a: 1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<< convertAccWith: 0.0 CPU 0.000017s TOTAL
<<<<< runAsyncIn.seq acc: 2.68e-4 CPU 0.000219s TOTAL
>>>>> evalCUDA
>>>>> push
<<<<< push: 0.0 CPU 0.000002s TOTAL
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<< compileOpenAcc.traveuseAcc.use1: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<< compileOpenAcc.traveuseAcc.seq arr: 0.105716 CPU 0.105501s TOTAL
>>>>> useArrayAsync
<<<<< useArrayAsync: 1.234e-3 CPU 0.001505s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use2: 0.108012 CPU 0.108015s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use3: 0.108539 CPU 0.108663s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Use: 0.109375 CPU 0.109005s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Fold1: 2.059e-3 CPU 0.002384s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Alet: 0.111434 CPU 0.112034s TOTAL
<<<<< compileOpenAcc: 0.11197 CPU 0.112615s TOTAL
<<<<< runAsyncIn.compileAcc: 0.11197 CPU 0.112833s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<< runAsyncIn.dumpStats: 2.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<< executeAcc: 8.96e-4 CPU 0.00049s TOTAL
<<<<< runAsyncIn.executeAcc: 9.36e-4 CPU 0.0007s TOTAL
>>>>> runAsyncIn.collect
<<<<< runAsyncIn.collect: 0.0 CPU 0.000027s TOTAL
<<<<< evalStateT: 0.114156 CPU 0.115327s TOTAL
>>>>> pop
<<<<< pop: 0.0 CPU 0.000002s TOTAL
>>>>> performGC
<<<<< performGC: 5.7246e-2 CPU 0.057814s TOTAL
<<<<< evalCUDA: 0.17295 CPU 0.173943s TOTAL
<<<<< runAsyncIn.execute: 0.215475 CPU 0.216563s TOTAL
<<<<< run: 0.215523 CPU 0.216771s TOTAL
Array (Z) [1000001.0]
0.217148s
Prelude Main> Leaving GHCi.
$ ghc -O2 -threaded Main.hs && ./Main
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
>>>>> run
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<< runAsyncIn.seq ctx: 4.0639e-2 CPU 0.041498s TOTAL
>>>>> runAsyncIn.seq a
<<<<< runAsyncIn.seq a: 1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<< convertAccWith: 1.2e-5 CPU 0.000005s TOTAL
<<<<< runAsyncIn.seq acc: 1.15e-4 CPU 0.000061s TOTAL
>>>>> evalCUDA
>>>>> push
<<<<< push: 2.0e-6 CPU 0.000002s TOTAL
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<< compileOpenAcc.traveuseAcc.use1: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<< compileOpenAcc.traveuseAcc.seq arr: 3.6651e-2 CPU 0.03712s TOTAL
>>>>> useArrayAsync
<<<<< useArrayAsync: 1.427e-3 CPU 0.001427s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use2: 3.8776e-2 CPU 0.039152s TOTAL
<<<<< compileOpenAcc.traveuseAcc.use3: 3.8794e-2 CPU 0.039207s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Use: 3.8808e-2 CPU 0.03923s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<< compileOpenAcc.traveuseAcc.Avar: 0.0 CPU 0.000001s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Fold1: 1.342e-3 CPU 0.001284s TOTAL
<<<<< compileOpenAcc.traveuseAcc.Alet: 4.0197e-2 CPU 0.040578s TOTAL
<<<<< compileOpenAcc: 4.0248e-2 CPU 0.040895s TOTAL
<<<<< runAsyncIn.compileAcc: 4.0834e-2 CPU 0.04103s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<< runAsyncIn.dumpStats: 0.0 CPU 0s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<< executeAcc: 2.87e-4 CPU 0.000403s TOTAL
<<<<< runAsyncIn.executeAcc: 2.87e-4 CPU 0.000488s TOTAL
>>>>> runAsyncIn.collect
<<<<< runAsyncIn.collect: 9.2e-5 CPU 0.000049s TOTAL
<<<<< evalStateT: 4.1213e-2 CPU 0.041739s TOTAL
>>>>> pop
<<<<< pop: 0.0 CPU 0.000002s TOTAL
>>>>> performGC
<<<<< performGC: 9.41e-4 CPU 0.000861s TOTAL
<<<<< evalCUDA: 4.3308e-2 CPU 0.042893s TOTAL
<<<<< runAsyncIn.execute: 8.5154e-2 CPU 0.084815s TOTAL
<<<<< run: 8.5372e-2 CPU 0.085035s TOTAL
Array (Z) [1000001.0]
0.085169s
结果:
- 评估向量在ghci下取0.121653s,在ghci下取0.035162s
编译版本
- 在ghci和
0.00031s在编译版本中
这可能是另一个问题,但可能有人知道:我们可以调整垃圾收集器,使其在ghci下工作得更快吗?您可以打开分析并获取报告吗?您的意思是:?我会尽快尝试。我已经在Data.Array.Accelerate.CUDA.run
中注入了一些时间度量代码,我注意到当accreate
库加载到ghci时,run
的执行速度比在可执行文件中使用时慢3倍。我尝试添加以下pragas,但没有效果<代码>{-#专用运行::Acc(数组DIM2双精度)->(数组DIM2双精度){-}{-#专用运行::Acc(数组DIM2浮点)->(数组DIM2浮点)#-}
。我们能否以某种方式优化ghci的这个运行函数?我已经更新了要点–添加了评测报告(main.prof)@ChristianConkle,我在这里尝试了您的测试,并将此重写规则注入加速cuda库-结果:isoptimized=True
,但代码仍然很慢。