Memory 在没有测量的情况下,我如何获得/计算我的GPU的内存延迟?
Memory 在没有测量的情况下,我如何获得/计算我的GPU的内存延迟?,memory,cuda,gpgpu,latency,Memory,Cuda,Gpgpu,Latency,cudaGetDeviceProperties()API调用似乎并没有告诉我们多少全局内存的延迟(甚至不是一个典型的值,或者一个最小/最大对等) 编辑:当我说延迟时,我实际上是指不同情况下必须从主设备内存读取数据的不同延迟。所以,如果我们采取,它实际上是6位数:{TLB L1 hit,TLB L2 hit,TLB miss}x一级数据缓存打开{on,off} Q1:除了自己测量之外,还有什么方法可以获得这些数字吗?即使是基于SM版本、SM时钟和mem时钟的经验法则计算也可以 我想问第二个问题,即
cudaGetDeviceProperties()
API调用似乎并没有告诉我们多少全局内存的延迟(甚至不是一个典型的值,或者一个最小/最大对等)
编辑:当我说延迟时,我实际上是指不同情况下必须从主设备内存读取数据的不同延迟。所以,如果我们采取,它实际上是6位数:{TLB L1 hit,TLB L2 hit,TLB miss}x一级数据缓存打开{on,off}
Q1:除了自己测量之外,还有什么方法可以获得这些数字吗?即使是基于SM版本、SM时钟和mem时钟的经验法则计算也可以 我想问第二个问题,即:
Q2:如果没有,是否有一个实用程序可以为您执行此操作?
(尽管这可能与网站无关。)与x86 CPU上的等效
cpuid
功能一样,cudaDeviceProperties()
的目的是返回相关的微体系结构参数。与CPU上一样,即使微体系结构参数相同,GPU上的性能特征也可能不同,例如,由于不同的时钟频率,或由于所连接DRAM的不同规格,以及这些与处理器内的各种缓冲和缓存机制交互的方式。一般来说,没有一个“内存延迟”可以分配,我也不知道如何从已知的微体系结构参数计算可能的范围
因此,在CPU和GPU上,必须利用复杂的微基准来确定性能参数,如DRAM延迟。如何为每个所需的参数构造这样的微基准点,在这里涉及的范围太广了。已经发表了多篇关于NVDIA GPU的论文,详细讨论了这一点。最早的相关出版物之一是():
通过微基准标记揭开GPU微体系结构的神秘面纱〉,摘自《2010 IEEE系统与软件性能分析国际研讨会论文集》,第235-246页
包括开普勒体系结构的一项最新工作是():
辛辛梅,楚小文。“通过微基准剖析GPU内存层次结构”,《Arxiv手稿》,2015年9月,第1-14页
除了构建自己的微基准之外,还必须依赖已发布的结果,例如上面提到的特定GPU的各种特定于实现的性能参数
在多年的GPU平台优化工作中,我不需要了解这类数据,一般来说,CUDA探查器的性能指标应足以追踪特定瓶颈。A1:GPU微体系结构的良好阅读,包括您询问的数据*,都在
图4.1
[符号多SM架构]和表4-1。
PTX指令类别、启动和执行周期
在开始构建GPU内核之前,无论何时设计成本/收益等式,都应该牢记这一点
,CPU-GPU-CPU管道是否能更快地完成任务
A2:
GPU指令代码模拟器是接收特定微体系结构必须花费的预期数量的GPU-CLK
的方法(至少,因为更多的GPU内核可能会同时使用硬件资源,从而延长在活体内观察到的端到端延迟)
数字很重要。
Category GPU
| Hardware
| Unit
| | Throughput
| | | Execution
| | | Latency
| | | | PTX instructions Note
|____________________________|____________|_______________|__________________|_____________________________________________________________________
Load_shared LSU 2 + 30 ld, ldu Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
Load_global LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_local LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_const LSU 2 + 600 ld, ldu Note, .ss = .const; .vec and .type determine the size of load
Load_param LSU 2 + 30 ld, ldu Note, .ss = .param; .vec and .type determine the size of load
| |
Store_shared LSU 2 + 30 st Note, .ss = .shared; .vec and .type determine the size of store
Store_global LSU 2 + 600 st Note, .ss = .global; .vec and .type determine the size of store
Store_local LSU 2 + 600 st Note, .ss = .local; .vec and .type determine the size of store
Read_modify_write_shared LSU 2 + 600 atom, red Note, .space = shared; .type determine the size
Read_modify_write_global LSU 2 + 600 atom, red Note, .space = global; .type determine the size
| |
Texture LSU 2 + 600 tex, txq, suld, sust, sured, suq
| |
Integer ALU 2 + 24 add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
| | Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
| |
Float_single ALU 2 + 24 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-single inst. with type = { .f32 };
Float_double ALU 1 + 48 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-double inst. with type = { .f64 };
Special_single SFU 8 + 48 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-single with type = { .f32 };
Special_double SFU 8 + 72 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-double with type = { .f64 };
|
Logical ALU 2 + 24 and, or, xor, not, cnot, shl, shr
Control ALU 2 + 24 bra, call, ret, exit
|
Synchronization ALU 2 + 24 bar, member, vote
Compare & Select ALU 2 + 24 set, setp, selp, slct
|
Conversion ALU 2 + 24 Isspacep, cvta, cvt
Miscellanies ALU 2 + 24 brkpt, pmevent, trap
Video ALU 2 + 24 vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset
+=========================+11-12[usec]X主机到设备的延迟~~~与Intel X48/nForce 790i相同
|延迟下降设备主机后的10-11[usec]X
|BW向上提升5.5 GB/sec,与DDR2/DDR3吞吐量相同
|在8192KB的测试负载下,5.2 GB/sec的XFER BW down(对PCIe|u总线(CLK 100-105-110-115[MHz])(不受超频PCIe|u总线(CLK 100-105-110-115[MHz])[D:4.9.3]
| ||||||||||||||||||
|| PCIe-2.0(4x)| ~4 GB/s跨4通道(端口#2)
|| PCIe-2.0(8x)| ~16 GB/s,8通道
|| PCIe-2.0(16x)| ~32 GB/s,16个通道(模式16x)
| ||||||||||||||||||
+====================|
PAR-α-yx~(?)?
|smREGs-设计溢出到locMEM的smREGs-惩罚+400~+800[GPU时钟]延迟(可由400~800个扭曲屏蔽)__
|1147兆赫费米时+350~+700[ns]@^^^^^^^^
| | ^^^^^^^^
|+5[ns]@200MHz FPGA……Xilinx/Zync Z7020/FPGA大规模并行流线计算模式ev.皮托化软件CPU
| | ^^^^^^^^
|~(20[ns]@1147 MHz费米^^^^^^^^
|SM寄存器/线程:CC-2.x的最大值为63-当从上一个[INSTR][G]提供算术结果时,只有大约+22[GPU_时钟]延迟(可由22个扭曲屏蔽)隐藏在[REGISTER DEPENDENCY]上:10.4,第46页
|
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
| ||||||||||||||||||
+====================|
| PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | |\ + |
o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| |\ \ |\ + |\ |
| texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB|
| | \ \ | \ + |\ ^ \ |
| | \ \ | \ + | \ ^ \ |
| | \ \ | \ + | \ ^ \ |
| texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB|
| | \ \ | \ + ^\ ^ \ ^\ \ |
| shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB|
| | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ |
| | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ |
| ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / \ CC-2.0|||||||||||||||||||||||||| ~masked ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \ | //
| \ RR-mode //
| \ GREEDY-mode //
| \________________//
| \______________/
* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
* GPU_CLK 1.15 GHz [Graphics 575 MHz]
* 3072 MB GDDR5 773MHz + ECC-correction
*
* 448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
* 48-ROPs
* 56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
on RT-version [5000]
CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]
CUDA API reports .self to operate a limited HEAP for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )
CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
with <_stream_PRIO_LOW____> == 725871085
with <_stream_PRIO_HIGH___> == 0
CUDA Device:0_ has <_compute capability_> == 2.0.
CUDA Device:0_ has [ Tesla M2050] .name
CUDA Device:0_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:0_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:0_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:0_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:0_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:0_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:0_ has [ 786432] .l2CacheSize
CUDA Device:0_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:0_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:0_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:0_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:0_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:0_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:0_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:0_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:0_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:0_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:0_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:0_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:0_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:0_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:0_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]
CUDA Device:1_ has <_compute capability_> == 2.0.
CUDA Device:1_ has [ Tesla M2050] .name
CUDA Device:1_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:1_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:1_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:1_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:1_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:1_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:1_ has [ 786432] .l2CacheSize
CUDA Device:1_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:1_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:1_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:1_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:1_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:1_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:1_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:1_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:1_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:1_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:1_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:1_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:1_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:1_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:1_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]