Python 在GPU上运行TensorFlow时没有看到性能改进

Python 在GPU上运行TensorFlow时没有看到性能改进,python,tensorflow,Python,Tensorflow,我按照TF帮助页面上的说明安装了Cuda和cuDNN,看起来一切正常。如果我打印可用的GPU,我会得到: >>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) Out: Num GPUs Available: 1 此外,当我开始在输出中训练序列模型时,我得到所有必要的库都已正确放置,并且GPU设备已成功创建: Created TensorFlow d

我按照TF帮助页面上的说明安装了Cuda和cuDNN,看起来一切正常。如果我打印可用的GPU,我会得到:

>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Out: Num GPUs Available:  1
此外,当我开始在输出中训练序列模型时,我得到所有必要的库都已正确放置,并且GPU设备已成功创建:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4733 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
但我看不到训练成绩有任何重大改善。这和以前在CPU上训练时差不多,我认为我的RTX3060应该会提供一些提升

在训练相对简单的序列模型时,我是否应该看到改进

编辑: 如果我禁用GPU培训并仅在CPU上培训,请使用:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
该型号在CPU上的训练时间为21.14秒,在GPU上的训练时间为57.59秒(!!!)

在培训期间,我也没有看到GPU负载像预期的那样增加:

import datetime as dt
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
from tensorflow import keras
import numpy as np

EPOCHS = 50
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10  # Number of outputs
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2
DROPOUT = 0.3

mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# X_train is 60,000 rows of 28x28 values
# Reshape it to 60,000x784
RESHAPED = 784

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize inputs between 0 and 1
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# One-hot encoding of labels
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)

# Build the model
model = tf.keras.models.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer2', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(NB_CLASSES, input_shape=(RESHAPED,),
          name='dense_layer3', activation='softmax'))

# Print summary of the model
model.summary()

# Compiling the model
model.compile(optimizer='Adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

t = dt.datetime.now()
# Training the model
model.fit(X_train, Y_train, batch_size=BATCH_SIZE,
          epochs=EPOCHS, verbose=VERBOSE,
          validation_split=VALIDATION_SPLIT)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy: ', test_acc)
print(f'Training elapsed: {dt.datetime.now()-t}')

还有我正在培训的模型的代码:

import datetime as dt
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
from tensorflow import keras
import numpy as np

EPOCHS = 50
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10  # Number of outputs
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2
DROPOUT = 0.3

mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# X_train is 60,000 rows of 28x28 values
# Reshape it to 60,000x784
RESHAPED = 784

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize inputs between 0 and 1
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# One-hot encoding of labels
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)

# Build the model
model = tf.keras.models.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer2', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(NB_CLASSES, input_shape=(RESHAPED,),
          name='dense_layer3', activation='softmax'))

# Print summary of the model
model.summary()

# Compiling the model
model.compile(optimizer='Adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

t = dt.datetime.now()
# Training the model
model.fit(X_train, Y_train, batch_size=BATCH_SIZE,
          epochs=EPOCHS, verbose=VERBOSE,
          validation_split=VALIDATION_SPLIT)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy: ', test_acc)
print(f'Training elapsed: {dt.datetime.now()-t}')

我将在这里给出一个答案,以防将来对任何人都有用。从评论中提供的信息以及对缓慢的回答来看,这似乎是两个因素共同作用的结果

对于初学者来说,在小矩阵上,CPU上的矩阵乘法由于更高的时钟速度而显著更快。其次,在CPU和GPU之间传输数据有很大的开销,在较小的输入上,GPU处理的任何性能增益都会被开销消耗掉

因此,在输入具有形状(784,)的MNIST数据集上,处理时间如下所示:

CPU-21s

GPU-57s

同时,在IMDB数据集上,输入有一个形状(10000),GPU处理的收益现在非常显著:

CPU-4分钟40秒

GPU-1分钟23秒

因此,对于较小的输入,最好禁用GPU处理,以便使用以下方式更快地拟合模型:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

你能在CPU和GPU上分享代码(模型、训练循环)以及计时结果吗?@LouisLac修正了这个问题。在GPU上它实际上要慢很多。不知道为什么,你查过这个了吗关于GPU负载,重要的是要了解,当模型较浅且数据较复杂时(MNIST仅为28x28个图像),GPU的性能最好。所以我猜,因为MNIST示例非常简单,所以根本不使用GPU资源。另外,为了检查是否实际使用了gpu,我建议这样做:
tf.test.is\u gpu\u available()
@aSaffary,谢谢,没有,我没有看到那个。