Memory 访问各种缓存和主内存的大致成本?

Memory 访问各种缓存和主内存的大致成本?,memory,latency,cpu-cache,low-latency,Memory,Latency,Cpu Cache,Low Latency,谁能告诉我访问英特尔i7处理器上的L1、L2和L3缓存以及主内存的大致时间(以纳秒为单位) 虽然这不是一个具体的编程问题,但对于一些低延迟编程挑战来说,了解这些类型的速度细节是必要的 每个人都应该知道的数字 0.5 ns - CPU L1 dCACHE reference 1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance 5 ns - CPU

谁能告诉我访问英特尔i7处理器上的L1、L2和L3缓存以及主内存的大致时间(以纳秒为单位)


虽然这不是一个具体的编程问题,但对于一些低延迟编程挑战来说,了解这些类型的速度细节是必要的

每个人都应该知道的数字

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|
发件人: 彼得·诺维格原著:
-
-,
-

适用于i7和Xeon系列处理器。我应该强调,这有你需要的,还有更多(例如,查看第22页的一些计时和循环)

此外,还有一些关于时钟周期等的详细信息。第二条链路提供以下数字:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns
EDIT2

最重要的是引用表格下的通知,上面写着:

注意:这些值是粗略的近似值。它们取决于 核心和非核心频率、内存速度、BIOS设置、, DIMM的数量等。您的里程可能会有所不同。

编辑:我应该强调的是,除了时间/周期信息外,上述英特尔文档还提供了更多(非常)有用的i7和Xeon系列处理器的详细信息(从性能角度来看)。

在一个漂亮的页面中访问各种内存的成本
总结
  • 价值有所下降,但自2005年以来趋于稳定

            1 ns        L1 cache
            3 ns        Branch mispredict
            4 ns        L2 cache
           17 ns        Mutex lock/unlock
          100 ns        Main memory (RAM)
        2 000 ns (2µs)  1KB Zippy-compress
    
  • 还有一些改进,预计2020年

       16 000 ns (16µs) SSD random read (olibre's note: should be less)
      500 000 ns (½ms)  Round trip in datacenter
    2 000 000 ns (2ms)  HDD random read (seek)
    
  • 另见其他来源
    • 每个程序员都应该从Ulrich Drepper(2007)中了解关于内存的内容
      关于内存硬件和软件交互的古老但仍然很好的深入解释。
      • (114页)
      • 关于LWN+评论的七篇帖子
    • 根据这本书在codinghorror.com上发布
    • 单击上列出的每个处理器以查看L1/L2/L3/RAM/。。。延迟(例如,L1=1ns,L2=3ns,L3=10ns,RAM=67ns,分支预测=4ns)
    另见 为了进一步了解,我推荐来自和的优秀(2014年6月)

    讲法语的人可能会喜欢一篇文章,因为他们会比较这两篇文章,然后等待继续工作所需的信息。

    为了让2020年对2025年的预测进行回顾: 在集成电路技术的最后大约44年中,经典(非量子)处理器从字面上和物理上都得到了发展。过去十年已经证明,经典过程已经接近一些障碍,这些障碍没有一条可实现的物理路径

    逻辑核的数量可以而且可能会增加,但不会超过O(n^2~3)

    频率[MHz]
    很难(如果不是不可能的话)绕过已经达到的基于物理的上限
    晶体管计数可以而且可能会增加,但小于O(n^2~3)
    (功率、噪声、时钟)
    功率[W]
    可能会增加,但配电和散热问题会增加
    单线程性能
    可能会增长,这直接得益于大缓存占用空间和更快更宽的内存I/O,间接得益于系统强制上下文切换的频率较低,因为我们可以有更多的内核来分割其他线程/进程


    (归功于莱昂纳多·苏里亚诺和卡尔·鲁普)

    为了2015年对2020年预测的回顾: 只是为了比较CPU和GPU的延迟情况: 要比较哪怕是最简单的CPU/cache/DRAM配置(即使在统一内存访问模型中也是如此),DRAM速度是决定延迟的一个因素,而加载延迟(饱和系统)则是一项不容易的任务,后者决定了企业应用程序将经历的不仅仅是一个空闲的完全卸载的系统

                        +----------------------------------- 5,6,7,8,9,..12,15,16 
                        |                               +--- 1066,1333,..2800..3300
                        v                               v
    First  word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate  
    Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
    Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
                                            ^----------------------- 7x .. difference
    ******************************** 
    So:
    ===
    
    resulting DDR3-side latencies are between _____________
                                              3.03 ns    ^
                                                         |
                                             36.58 ns ___v_ based on DDR3 HW facts
    

    GPU引擎已经接受了大量的技术营销,而深刻的内部依赖性是理解这些架构在实践中所体验到的真正优势和真正劣势的关键(通常与积极的营销所带来的期望大不相同)


    我为“更大的图景”道歉,但是延迟去屏蔽也受到了片上smREG/L1/L2容量和命中/未命中率的主要限制

        |.pci............GPU.|
        |                    | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES                                                                  PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
        |                                                                                                                                               ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
        |                                                       smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
        |                                                                                                              +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
        |                                                                                                                          |                    ^^^^^^^^
        |                                                                                                                       +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
        |                                                                                                                          |                    ^^^^^^^^
        |                                                                                                                   ~  +20 [ns] @1147 MHz FERMI ^^^^^^^^
        |                                                             SM-REGISTERs/thread: max  63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
        |                                                                                  max  63 for CC-3.0 -          about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
        |                                                                                  max 128 for CC-1.x                                    PAR -- ||||||||~~~|
        |                                                                                  max 255 for CC-3.5                                    PAR -- ||||||||||||||||||~~~~~~|
        |
        |                                                       smREGs___BW                                 ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE <<  -Xptxas -v          || nvcc -maxrregcount ( w|w/o spillover(s) )
        |                                                                with about 8.0  TB/s BW            [C:Pg.46]
        |                                                                           1.3  TB/s BW shaMEM___  4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
        |                                                                           0.1  TB/s BW gloMEM___
        |         ________________________________________________________________________________________________________________________________________________________________________________________________________________________
        +========|   DEVICE:3 PERSISTENT                          gloMEM___
        |       _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +======|   DEVICE:2 PERSISTENT                          gloMEM___
        |     _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +====|   DEVICE:1 PERSISTENT                          gloMEM___
        |   _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +==|   DEVICE:0 PERSISTENT                          gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
        !  |                                                         |\                                                                +                                                                                           |
        o  |                                                texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
           |                                                         |\ \                                 |\                           +                                               |\                                          |
           |                                              texL2cache_| \ \                               .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \                                   256_KB|
           |                                                         |  \ \                               |  \                         +                                 |\            ^  \                                        |
           |                                                         |   \ \                              |   \                        +                                 | \           ^   \                                       |
           |                                                         |    \ \                             |    \                       +                                 |  \          ^    \                                      |
           |                                              texL1cache_|     \ \                           .|     \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ |   \_ _ _ _ _^     \                                 5_KB|
           |                                                         |      \ \                           |      \                     +                         ^\      ^    \        ^\     \                                    |
           |                                     shaMEM + conL3cache_|       \ \                          |       \ _ _ _ _ conL3cache +220 [GPU_CLKs]           ^ \     ^     \       ^ \     \                              32_KB|
           |                                                         |        \ \                         |        \       ^\          +                         ^  \    ^      \      ^  \     \                                  |
           |                                                         |         \ \                        |         \      ^ \         +                         ^   \   ^       \     ^   \     \                                 |
           |                                   ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
           |                  +220 [GPU-CLKs]_|           |_ _ _  ___|\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
           | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB  L2_|_ _ _   __|\\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
           | L1-on-re-use-only +40 [GPU-CLKs]_|  8 KB  L1_|_ _ _    _|\\\          \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
           | L1-on-re-use-only + 8 [GPU-CLKs]_|  2 KB  L1_|__________|\\\\__________\_\__________________________________\________\____+  8 [GPU_CLKs]_________________________________________________________conL1cache      2_KB|
           |     on-chip|smREG +22 [GPU-CLKs]_|           |t[0_______^:~~~~~~~~~~~~~~~~\:________]
           |CC-  MAX    |_|_|_|_|_|_|_|_|_|_|_|           |t[1_______^                  :________]
           |2.x   63    |_|_|_|_|_|_|_|_|_|_|_|           |t[2_______^                  :________] 
           |1.x  128    |_|_|_|_|_|_|_|_|_|_|_|           |t[3_______^                  :________]
           |3.5  255 REGISTERs|_|_|_|_|_|_|_|_|           |t[4_______^                  :________]
           |         per|_|_|_|_|_|_|_|_|_|_|_|           |t[5_______^                  :________]
           |         Thread_|_|_|_|_|_|_|_|_|_|           |t[6_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[7_______^     1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 9_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ A_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ B_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ C_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ D_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ E_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|       W0..|t[ F_______^____________WARP__:________]_____________
           |            |_|_|_|_|_|_|_|_|_|_|_|         ..............             
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[1_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[2_______^                 :________] 
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[3_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[4_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[5_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[6_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[7_______^    1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 9_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ A_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ B_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ C_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ D_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ E_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|       W1..............|t[ F_______^___________WARP__:________]_____________
           |            |_|_|_|_|_|_|_|_|_|_|_|         ....................................................
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[1_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[2_______^                 :________] 
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[3_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[4_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[5_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[6_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[7_______^    1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ 9_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ A_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ B_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ C_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ D_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ E_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
           |
           |                   ________________          °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
           |                  /                \   CC-2.0|||||||||||||||||||||||||| ~masked  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
           |                 /                  \  1.hW  ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
           |                /                    \ 2.hW  |^|^|^|^|^|^|^|^|^|^|^|^|^          |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
           |_______________/                      \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
           |~~~~~~~~~~~~~~/ SM:0.warpScheduler    /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
           |              \          |           //
           |               \         RR-mode    //
           |                \    GREEDY-mode   //
           |                 \________________//
           |                   \______________/SM:0__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:1__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:2__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:3__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:4__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:5__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:6__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:7__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:8__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:9__________________________________________________________________________________
           |                                ..|SM:A      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:B      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:C      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:D      |t[ F_______^___________WARP__:________]_______
           |                                  |_______________________________________________________________________________________
           */
    
    |.pci………GPU|
    {费米[GPU-CKK]~0.9 [NS],但I/O潜伏期PAR-α,α,α,β,α,β,α,β,α,β,α,β,α,α,β,α,β,α,β,α,β,π,α,β,π,α,β,α,β,π,α,β,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,α,β,α,β,α,β,α,β,α,β,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,π,α,β,α,β,π,α,β,π,α,β,π,α,β,α,β,π,α,β,π,α,β,π,α,β,β,
    |                                                                                                                                               ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
    |smREGs-设计溢出到locMEM的smREGs-惩罚+400~+800[GPU时钟]延迟(可由400~800个扭曲屏蔽)__
    |1147兆赫费米时+350~+700[ns]@^^^^^^^^
    |                                                                                                                          |                    ^^^^^^^^
    |+5[ns]@200 MHz FPGA。Xilinx/zyncz7020/FPGA大规模并行流线计算模式ev。皮托化器软CPU
    |
    
                        +----------------------------------- 5,6,7,8,9,..12,15,16 
                        |                               +--- 1066,1333,..2800..3300
                        v                               v
    First  word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate  
    Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
    Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
                                            ^----------------------- 7x .. difference
    ******************************** 
    So:
    ===
    
    resulting DDR3-side latencies are between _____________
                                              3.03 ns    ^
                                                         |
                                             36.58 ns ___v_ based on DDR3 HW facts
    
       1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
              °      ^
              |\     |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
              | \    |
              |  \   |
            __|___\__v____________________________________________________
              |    |
              |<-->|  a 1 ns TimeDOMAIN "distance", before a foton arrived
              |    |
              ^    v 
        DATA  |    |DATA
        RQST'd|    |RECV'd ( DATA XFER/FETCH latency )
    
      25 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor REGISTER access
      35 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor    L1-onHit-[--8kB]CACHE
    
      70 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor SHARED-MEM access
    
     230 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
     320 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE
    
     350 ns
     700 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor GLOBAL-MEM access
     - - - - -
    
        +====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
        |   |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
        |   |||||||||||||||||| ~  5.5 GB/sec XFER-BW-up                         ~~~ same as DDR2/DDR3 throughput
        |   |||||||||||||||||| ~  5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD      ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
        |                       
        |              Host-side
        |                                                        cudaHostRegister(   void *ptr, size_t size, unsigned int flags )
        |                                                                                                                 | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed
        |                        ___HostAllocWriteCombined_MEM / cudaHostFree()                                           +---------------- cudaHostRegisterMapped   -- maps  memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); )
        |                        ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr )
        |   ||||||||||||||||||
        |   ||||||||||||||||||
        |   | PCIe-2.0 ( 4x) | ~ 4 GB/s over  4-Lanes ( PORT #2  )
        |   | PCIe-2.0 ( 8x) | ~16 GB/s over  8-Lanes
        |   | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
        |
        |   + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber
        |                                                                       ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:]
        |
        | ~810 [ns]    + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any
        |
        |   ||||||||||||||||||
        |   ||||||||||||||||||
        +====================|
        |.pci............HOST|
    
        |.pci............GPU.|
        |                    | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES                                                                  PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
        |                                                                                                                                               ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
        |                                                       smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
        |                                                                                                              +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
        |                                                                                                                          |                    ^^^^^^^^
        |                                                                                                                       +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
        |                                                                                                                          |                    ^^^^^^^^
        |                                                                                                                   ~  +20 [ns] @1147 MHz FERMI ^^^^^^^^
        |                                                             SM-REGISTERs/thread: max  63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
        |                                                                                  max  63 for CC-3.0 -          about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
        |                                                                                  max 128 for CC-1.x                                    PAR -- ||||||||~~~|
        |                                                                                  max 255 for CC-3.5                                    PAR -- ||||||||||||||||||~~~~~~|
        |
        |                                                       smREGs___BW                                 ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE <<  -Xptxas -v          || nvcc -maxrregcount ( w|w/o spillover(s) )
        |                                                                with about 8.0  TB/s BW            [C:Pg.46]
        |                                                                           1.3  TB/s BW shaMEM___  4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
        |                                                                           0.1  TB/s BW gloMEM___
        |         ________________________________________________________________________________________________________________________________________________________________________________________________________________________
        +========|   DEVICE:3 PERSISTENT                          gloMEM___
        |       _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +======|   DEVICE:2 PERSISTENT                          gloMEM___
        |     _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +====|   DEVICE:1 PERSISTENT                          gloMEM___
        |   _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
        +==|   DEVICE:0 PERSISTENT                          gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
        !  |                                                         |\                                                                +                                                                                           |
        o  |                                                texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
           |                                                         |\ \                                 |\                           +                                               |\                                          |
           |                                              texL2cache_| \ \                               .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \                                   256_KB|
           |                                                         |  \ \                               |  \                         +                                 |\            ^  \                                        |
           |                                                         |   \ \                              |   \                        +                                 | \           ^   \                                       |
           |                                                         |    \ \                             |    \                       +                                 |  \          ^    \                                      |
           |                                              texL1cache_|     \ \                           .|     \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ |   \_ _ _ _ _^     \                                 5_KB|
           |                                                         |      \ \                           |      \                     +                         ^\      ^    \        ^\     \                                    |
           |                                     shaMEM + conL3cache_|       \ \                          |       \ _ _ _ _ conL3cache +220 [GPU_CLKs]           ^ \     ^     \       ^ \     \                              32_KB|
           |                                                         |        \ \                         |        \       ^\          +                         ^  \    ^      \      ^  \     \                                  |
           |                                                         |         \ \                        |         \      ^ \         +                         ^   \   ^       \     ^   \     \                                 |
           |                                   ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
           |                  +220 [GPU-CLKs]_|           |_ _ _  ___|\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
           | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB  L2_|_ _ _   __|\\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
           | L1-on-re-use-only +40 [GPU-CLKs]_|  8 KB  L1_|_ _ _    _|\\\          \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
           | L1-on-re-use-only + 8 [GPU-CLKs]_|  2 KB  L1_|__________|\\\\__________\_\__________________________________\________\____+  8 [GPU_CLKs]_________________________________________________________conL1cache      2_KB|
           |     on-chip|smREG +22 [GPU-CLKs]_|           |t[0_______^:~~~~~~~~~~~~~~~~\:________]
           |CC-  MAX    |_|_|_|_|_|_|_|_|_|_|_|           |t[1_______^                  :________]
           |2.x   63    |_|_|_|_|_|_|_|_|_|_|_|           |t[2_______^                  :________] 
           |1.x  128    |_|_|_|_|_|_|_|_|_|_|_|           |t[3_______^                  :________]
           |3.5  255 REGISTERs|_|_|_|_|_|_|_|_|           |t[4_______^                  :________]
           |         per|_|_|_|_|_|_|_|_|_|_|_|           |t[5_______^                  :________]
           |         Thread_|_|_|_|_|_|_|_|_|_|           |t[6_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[7_______^     1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 9_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ A_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ B_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ C_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ D_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ E_______^                  :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|       W0..|t[ F_______^____________WARP__:________]_____________
           |            |_|_|_|_|_|_|_|_|_|_|_|         ..............             
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[1_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[2_______^                 :________] 
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[3_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[4_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[5_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[6_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[7_______^    1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 9_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ A_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ B_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ C_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ D_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ E_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|       W1..............|t[ F_______^___________WARP__:________]_____________
           |            |_|_|_|_|_|_|_|_|_|_|_|         ....................................................
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[1_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[2_______^                 :________] 
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[3_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[4_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[5_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[6_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[7_______^    1stHalf-WARP :________]______________
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ 9_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ A_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ B_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ C_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ D_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|          ...................................................|t[ E_______^                 :________]
           |            |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
           |
           |                   ________________          °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
           |                  /                \   CC-2.0|||||||||||||||||||||||||| ~masked  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
           |                 /                  \  1.hW  ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
           |                /                    \ 2.hW  |^|^|^|^|^|^|^|^|^|^|^|^|^          |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
           |_______________/                      \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
           |~~~~~~~~~~~~~~/ SM:0.warpScheduler    /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
           |              \          |           //
           |               \         RR-mode    //
           |                \    GREEDY-mode   //
           |                 \________________//
           |                   \______________/SM:0__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:1__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:2__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:3__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:4__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:5__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:6__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:7__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:8__________________________________________________________________________________
           |                                  |           |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:9__________________________________________________________________________________
           |                                ..|SM:A      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:B      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:C      |t[ F_______^___________WARP__:________]_______
           |                                ..|SM:D      |t[ F_______^___________WARP__:________]_______
           |                                  |_______________________________________________________________________________________
           */