Python Tensorflow:GPU加速只在第一次运行后发生_Python_Tensorflow_Gpu_Nvidia

Python Tensorflow:GPU加速只在第一次运行后发生

python tensorflow

Python Tensorflow:GPU加速只在第一次运行后发生,python,tensorflow,gpu,nvidia,Python,Tensorflow,Gpu,Nvidia,我已经在我的机器（Ubuntu16.04）上安装了CUDA和CUDNN，旁边是tensorflow gpu 使用的版本：CUDA 10.0、CUDNN 7.6、Python 3.6、Tensorflow 1.14 这是nvidia smi的输出，显示视频卡配置 | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+-----

我已经在我的机器（Ubuntu16.04）上安装了CUDA和CUDNN，旁边是

tensorflow gpu

使用的版本：CUDA 10.0、CUDNN 7.6、Python 3.6、Tensorflow 1.14

这是nvidia smi的输出，显示视频卡配置

| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    On   | 00000000:02:00.0 Off |                  N/A |
| N/A   44C    P8    N/A /  N/A |    675MiB /  4046MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1502      G   /usr/lib/xorg/Xorg                           363MiB |
|    0      3281      G   compiz                                        96MiB |
|    0      4375      G   ...uest-channel-token=14359313252217012722    69MiB |
|    0      5157      C   ...felipe/proj/venv/bin/python3.6            141MiB |
+-----------------------------------------------------------------------------+

这是

设备库.列表\u local\u devices（）

的输出（tensorflow helper方法显示它可以看到哪些设备），显示我的GPU对tensorflow可见：

[name: "/device:CPU:0"
  device_type: "CPU"
  memory_limit: 268435456
  locality {
  }
  incarnation: 5096693727819965430, 
name: "/device:XLA_GPU:0"
  device_type: "XLA_GPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 13415556283266501672
  physical_device_desc: "device: XLA_GPU device", 
name: "/device:XLA_CPU:0"
  device_type: "XLA_CPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 14339781620792127180
  physical_device_desc: "device: XLA_CPU device", 
name: "/device:GPU:0"
  device_type: "GPU"
  memory_limit: 3464953856
  locality {
    bus_id: 1
    links {
    }
  }
  incarnation: 13743207545082600644
  physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0"
]

现在来谈谈实际使用GPU进行计算。我使用了一小段代码在CPU和GPU上运行一些虚拟矩阵乘法，以比较性能：

shapes = [(50, 50), (100, 100), (500, 500), (1000, 1000), (10000,10000), (15000,15000)]

devices = ['/device:CPU:0', '/device:XLA_GPU:0']

for device in devices:
    for shape in shapes:
        with tf.device(device):
            random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
            dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
            sum_operation = tf.reduce_sum(dot_operation)

        # Time the actual runtime of the operations
        start_time = datetime.now()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
            result = session.run(sum_operation)
        elapsed_time = datetime.now() - start_time

        # PRINT ELAPSED TIME, SHAPE AND DEVICE USED

这是惊喜。我第一次运行包含这段代码的单元时（我在jupyter笔记本上），GPU的计算时间比CPU要长得多：

# output of first run: CPU is faster ---------------------------------------- Input shape: (50, 50) using Device: /device:CPU:0 took: 0.01 Input shape: (100, 100) using Device: /device:CPU:0 took: 0.01 Input shape: (500, 500) using Device: /device:CPU:0 took: 0.01 Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.02 Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.22 Input shape: (15000, 15000) using Device: /device:CPU:0 took: 21.23 ---------------------------------------- Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 2.82 Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.17 Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.18 Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.20 Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 28.36 Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 93.73 ----------------------------------------
惊喜#2：当我重新运行包含伪矩阵乘法代码的单元格时，GPU版本要快得多（如预期的那样）：
所以我的问题是：为什么只有在我运行一次代码之后，GPU才会真正加速？
我可以看到GPU设置正确（否则根本不会发生加速）。这是由于某种初始开销造成的吗？GPU在实际使用之前需要预热吗
p.S.：在两次运行中（即GPU速度较慢的一次运行和GPU速度较快的下一次运行），我都可以看到GPU的使用率为100%，因此它肯定被使用
p.S.：只有在第一次运行时，GPU才似乎没有被启动。如果我运行它两次、三次或多次，所有的运行在第一个成功之后（即GPU计算更快）。X/L>
让我查看XLA的东西，这帮助我找到了解决方案。
将GPU映射为一个TysFooFlash设备，有两种方式：XLA设备和普通GPU。p> 这就是为什么有两个设备，一个名为
“/device:XLA\GPU:0”
，另一个名为
“/device:GPU:0”

我所需要做的就是激活
“/device:GPU:0”
。现在，Tensorflow立即接收GPU。
确实有两个GPU设备，将
CPU
与
XLA\U GPU
进行比较可能不是最好的比较。不过，我也希望你的问题的答案（为什么第一次迭代要花这么长时间）是因为JIT机制是XLA固有的。JIT机制在第一次使用时运行，并涉及许多额外的处理步骤。代码仍在GPU上执行，但JIT过程需要额外的时间。此后，对该函数的后续调用不会产生JIT开销，并且运行得更快。合理地确定这是“预期”行为。@RobertCrovella你可以用它写一个答案，我会把它标记为正确答案
# output of reruns: GPU is faster ---------------------------------------- Input shape: (50, 50) using Device: /device:CPU:0 took: 0.02 Input shape: (100, 100) using Device: /device:CPU:0 took: 0.02 Input shape: (500, 500) using Device: /device:CPU:0 took: 0.02 Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.04 Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.78 Input shape: (15000, 15000) using Device: /device:CPU:0 took: 24.65 ---------------------------------------- Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 0.14 Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.12 Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.13 Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.14 Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 1.64 Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 5.29 ----------------------------------------