Memory 在没有测量的情况下,我如何获得/计算我的GPU的内存延迟?

Memory 在没有测量的情况下,我如何获得/计算我的GPU的内存延迟?,memory,cuda,gpgpu,latency,Memory,Cuda,Gpgpu,Latency,cudaGetDeviceProperties()API调用似乎并没有告诉我们多少全局内存的延迟(甚至不是一个典型的值,或者一个最小/最大对等) 编辑:当我说延迟时,我实际上是指不同情况下必须从主设备内存读取数据的不同延迟。所以,如果我们采取,它实际上是6位数:{TLB L1 hit,TLB L2 hit,TLB miss}x一级数据缓存打开{on,off} Q1:除了自己测量之外,还有什么方法可以获得这些数字吗?即使是基于SM版本、SM时钟和mem时钟的经验法则计算也可以 我想问第二个问题,即

cudaGetDeviceProperties()
API调用似乎并没有告诉我们多少全局内存的延迟(甚至不是一个典型的值,或者一个最小/最大对等)

编辑:当我说延迟时,我实际上是指不同情况下必须从主设备内存读取数据的不同延迟。所以,如果我们采取,它实际上是6位数:{TLB L1 hit,TLB L2 hit,TLB miss}x一级数据缓存打开{on,off}

Q1:除了自己测量之外,还有什么方法可以获得这些数字吗?
即使是基于SM版本、SM时钟和mem时钟的经验法则计算也可以

我想问第二个问题,即:
Q2:如果没有,是否有一个实用程序可以为您执行此操作?

(尽管这可能与网站无关。)

与x86 CPU上的等效
cpuid
功能一样,
cudaDeviceProperties()
的目的是返回相关的微体系结构参数。与CPU上一样,即使微体系结构参数相同,GPU上的性能特征也可能不同,例如,由于不同的时钟频率,或由于所连接DRAM的不同规格,以及这些与处理器内的各种缓冲和缓存机制交互的方式。一般来说,没有一个“内存延迟”可以分配,我也不知道如何从已知的微体系结构参数计算可能的范围

因此,在CPU和GPU上,必须利用复杂的微基准来确定性能参数,如DRAM延迟。如何为每个所需的参数构造这样的微基准点,在这里涉及的范围太广了。已经发表了多篇关于NVDIA GPU的论文,详细讨论了这一点。最早的相关出版物之一是():

通过微基准标记揭开GPU微体系结构的神秘面纱〉,摘自《2010 IEEE系统与软件性能分析国际研讨会论文集》,第235-246页

包括开普勒体系结构的一项最新工作是():

辛辛梅,楚小文。“通过微基准剖析GPU内存层次结构”,《Arxiv手稿》,2015年9月,第1-14页

除了构建自己的微基准之外,还必须依赖已发布的结果,例如上面提到的特定GPU的各种特定于实现的性能参数


在多年的GPU平台优化工作中,我不需要了解这类数据,一般来说,CUDA探查器的性能指标应足以追踪特定瓶颈。

A1:GPU微体系结构的良好阅读,包括您询问的数据*,都在

图4.1
[符号多SM架构]和
表4-1。
PTX指令类别、启动和执行周期 在开始构建GPU内核之前,无论何时设计成本/收益等式,都应该牢记这一点

,CPU-GPU-CPU管道是否能更快地完成任务


A2:
GPU指令代码模拟器是接收特定微体系结构必须花费的预期数量的
GPU-CLK
的方法(至少,因为更多的GPU内核可能会同时使用硬件资源,从而延长在活体内观察到的端到端延迟)

数字很重要。

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset

+=========================+11-12[usec]X主机到设备的延迟~~~与Intel X48/nForce 790i相同
|延迟下降设备主机后的10-11[usec]X
|BW向上提升5.5 GB/sec,与DDR2/DDR3吞吐量相同
|在8192KB的测试负载下,5.2 GB/sec的XFER BW down(对PCIe|u总线(CLK 100-105-110-115[MHz])(不受超频PCIe|u总线(CLK 100-105-110-115[MHz])[D:4.9.3]
|   ||||||||||||||||||
|| PCIe-2.0(4x)| ~4 GB/s跨4通道(端口#2)
|| PCIe-2.0(8x)| ~16 GB/s,8通道
|| PCIe-2.0(16x)| ~32 GB/s,16个通道(模式16x)
|   ||||||||||||||||||
+====================|
PAR-α-yx~(?)?
|smREGs-设计溢出到locMEM的smREGs-惩罚+400~+800[GPU时钟]延迟(可由400~800个扭曲屏蔽)__
|1147兆赫费米时+350~+700[ns]@^^^^^^^^
|                                                                                                                          |                    ^^^^^^^^
|+5[ns]@200MHz FPGA……Xilinx/Zync Z7020/FPGA大规模并行流线计算模式ev.皮托化软件CPU
|                                                                                                                          |                    ^^^^^^^^
|~(20[ns]@1147 MHz费米^^^^^^^^
|SM寄存器/线程:CC-2.x的最大值为63-当从上一个[INSTR][G]提供算术结果时,只有大约+22[GPU_时钟]延迟(可由22个扭曲屏蔽)隐藏在[REGISTER DEPENDENCY]上:10.4,第46页
|
+====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
|   |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
|   |||||||||||||||||| ~  5.5 GB/sec XFER-BW-up                         ~~~ same as DDR2/DDR3 throughput
|   |||||||||||||||||| ~  5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD      ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|   ||||||||||||||||||
|   | PCIe-2.0 ( 4x) | ~ 4 GB/s over  4-Lanes ( PORT #2  )
|   | PCIe-2.0 ( 8x) | ~16 GB/s over  8-Lanes
|   | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|   ||||||||||||||||||
+====================|
|                                                                                                                                        PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
|                                                       smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
|                                                                                                              +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                       +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                   ~  +20 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                             SM-REGISTERs/thread: max  63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
|                                                                                  max  63 for CC-3.0 -          about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
|                                                                                  max 128 for CC-1.x                                    PAR -- ||||||||~~~|
|                                                                                  max 255 for CC-3.5                                    PAR -- ||||||||||||||||||~~~~~~|
|
|                                                       smREGs___BW                                 ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE <<  -Xptxas -v          || nvcc -maxrregcount ( w|w/o spillover(s) )
|                                                                with about 8.0  TB/s BW            [C:Pg.46]
|                                                                           1.3  TB/s BW shaMEM___  4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
|                                                                           0.1  TB/s BW gloMEM___
|         ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========|   DEVICE:3 PERSISTENT                          gloMEM___
|       _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======|   DEVICE:2 PERSISTENT                          gloMEM___
|     _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====|   DEVICE:1 PERSISTENT                          gloMEM___
|   _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==|   DEVICE:0 PERSISTENT                          gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
!  |                                                         |\                                                                +                                                                                           |
o  |                                                texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
   |                                                         |\ \                                 |\                           +                                               |\                                          |
   |                                              texL2cache_| \ \                               .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \                                   256_KB|
   |                                                         |  \ \                               |  \                         +                                 |\            ^  \                                        |
   |                                                         |   \ \                              |   \                        +                                 | \           ^   \                                       |
   |                                                         |    \ \                             |    \                       +                                 |  \          ^    \                                      |
   |                                              texL1cache_|     \ \                           .|     \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ |   \_ _ _ _ _^     \                                 5_KB|
   |                                                         |      \ \                           |      \                     +                         ^\      ^    \        ^\     \                                    |
   |                                     shaMEM + conL3cache_|       \ \                          |       \ _ _ _ _ conL3cache +220 [GPU_CLKs]           ^ \     ^     \       ^ \     \                              32_KB|
   |                                                         |        \ \                         |        \       ^\          +                         ^  \    ^      \      ^  \     \                                  |
   |                                                         |         \ \                        |         \      ^ \         +                         ^   \   ^       \     ^   \     \                                 |
   |                                   ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
   |                  +220 [GPU-CLKs]_|           |_ _ _  ___|\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
   | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB  L2_|_ _ _   __|\\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
   | L1-on-re-use-only +40 [GPU-CLKs]_|  8 KB  L1_|_ _ _    _|\\\          \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
   | L1-on-re-use-only + 8 [GPU-CLKs]_|  2 KB  L1_|__________|\\\\__________\_\__________________________________\________\____+  8 [GPU_CLKs]_________________________________________________________conL1cache      2_KB|
   |     on-chip|smREG +22 [GPU-CLKs]_|           |t[0_______^:~~~~~~~~~~~~~~~~\:________]
   |CC-  MAX    |_|_|_|_|_|_|_|_|_|_|_|           |t[1_______^                  :________]
   |2.x   63    |_|_|_|_|_|_|_|_|_|_|_|           |t[2_______^                  :________] 
   |1.x  128    |_|_|_|_|_|_|_|_|_|_|_|           |t[3_______^                  :________]
   |3.5  255 REGISTERs|_|_|_|_|_|_|_|_|           |t[4_______^                  :________]
   |         per|_|_|_|_|_|_|_|_|_|_|_|           |t[5_______^                  :________]
   |         Thread_|_|_|_|_|_|_|_|_|_|           |t[6_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[7_______^     1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 9_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ A_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ B_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ C_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ D_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ E_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W0..|t[ F_______^____________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|         ..............             
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[1_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[2_______^                 :________] 
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[3_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[4_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[5_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[6_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[7_______^    1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 9_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ A_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ B_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ C_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ D_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ E_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W1..............|t[ F_______^___________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
   |
   |                   ________________          °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
   |                  /                \   CC-2.0|||||||||||||||||||||||||| ~masked  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |                 /                  \  1.hW  ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
   |                /                    \ 2.hW  |^|^|^|^|^|^|^|^|^|^|^|^|^          |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
   |_______________/                      \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
   |~~~~~~~~~~~~~~/ SM:0.warpScheduler    /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
   |              \          |           //
   |               \         RR-mode    //
   |                \    GREEDY-mode   //
   |                 \________________//
   |                   \______________/
* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
*                      GPU_CLK 1.15 GHz [Graphics 575 MHz]
*                      3072 MB GDDR5 773MHz + ECC-correction
* 
*                       448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
*                        48-ROPs
*                        56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
                                  on RT-version         [5000]

CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]

CUDA API reports .self to operate a limited HEAP                  for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )

CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
                                                  with <_stream_PRIO_LOW____> == 725871085
                                                  with <_stream_PRIO_HIGH___> == 0


  CUDA Device:0_ has <_compute capability_> == 2.0.
  CUDA Device:0_ has [    Tesla M2050] .name
  CUDA Device:0_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:0_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:0_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:0_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:0_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:0_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:0_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:0_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:0_ has [         786432] .l2CacheSize
  CUDA Device:0_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:0_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:0_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:0_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:0_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:0_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:0_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:0_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:0_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:0_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:0_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:0_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:0_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:0_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:0_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]

  CUDA Device:1_ has <_compute capability_> == 2.0.
  CUDA Device:1_ has [    Tesla M2050] .name
  CUDA Device:1_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:1_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:1_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:1_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:1_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:1_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:1_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:1_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:1_ has [         786432] .l2CacheSize
  CUDA Device:1_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:1_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:1_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:1_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:1_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:1_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:1_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:1_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:1_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:1_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:1_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:1_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:1_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:1_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:1_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]