使用Theano后端的Keras是否应该比使用Tensorflow后端的Keras慢18倍以上？_Tensorflow_Theano_Keras

使用Theano后端的Keras是否应该比使用Tensorflow后端的Keras慢18倍以上？

tensorflow keras

使用Theano后端的Keras是否应该比使用Tensorflow后端的Keras慢18倍以上？,tensorflow,theano,keras,Tensorflow,Theano,Keras,我刚刚在我的机器上安装了keras、tensorflow和theano，并对keras使用tensorflow作为后端与使用theano作为后端进行了快速比较。结果比我想象的更极端我正在使用3个软件包的以下版本： >>> theano.__version__ '0.8.2' >>> tensorflow.__version__ '0.12.1' >>> keras.__version__ '1.2.1' 为了比较两个后端，我使用了cifa

我刚刚在我的机器上安装了keras、tensorflow和theano，并对keras使用tensorflow作为后端与使用theano作为后端进行了快速比较。结果比我想象的更极端

我正在使用3个软件包的以下版本：

>>> theano.__version__
'0.8.2'
>>> tensorflow.__version__
'0.12.1'
>>> keras.__version__
'1.2.1'

为了比较两个后端，我使用了

cifar10\u cnn.py

当我使用tensorflow作为后端时，我得到以下结果：

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least   one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Quadro M5000M
major: 5 minor: 2 memoryClockRate (GHz) 1.0505
pciBusID 0000:01:00.0
Total memory: 7.93GiB
Free memory: 7.59GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M5000M, pci bus id: 0000:01:00.0)
50000/50000 [==============================] - 16s - loss: 1.7916 - acc:    0.3350 - val_loss: 1.4998 - val_acc: 0.4490
Epoch 2/200
50000/50000 [==============================] - 15s - loss: 1.4020 - acc: 0.4907 - val_loss: 1.2039 - val_acc: 0.5779
Epoch 3/200
50000/50000 [==============================] - 15s - loss: 1.2460 - acc: 0.5531 - val_loss: 1.0272 - val_acc: 0.6311

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using Theano backend.
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
50000/50000 [==============================] - 292s - loss: 1.8008 - acc: 0.3286 - val_loss: 1.4991 - val_acc: 0.4613
Epoch 2/200
50000/50000 [==============================] - 285s - loss: 1.4302 - acc: 0.4774 - val_loss: 1.1840 - val_acc: 0.5737
Epoch 3/200
50000/50000 [==============================] - 288s - loss: 1.2690 - acc: 0.5452 - val_loss: 1.0930 - val_acc: 0.6030

整个过程只需51分钟即可完成

当我使用Theano作为后端时，我得到以下结果：

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least   one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Quadro M5000M
major: 5 minor: 2 memoryClockRate (GHz) 1.0505
pciBusID 0000:01:00.0
Total memory: 7.93GiB
Free memory: 7.59GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M5000M, pci bus id: 0000:01:00.0)
50000/50000 [==============================] - 16s - loss: 1.7916 - acc:    0.3350 - val_loss: 1.4998 - val_acc: 0.4490
Epoch 2/200
50000/50000 [==============================] - 15s - loss: 1.4020 - acc: 0.4907 - val_loss: 1.2039 - val_acc: 0.5779
Epoch 3/200
50000/50000 [==============================] - 15s - loss: 1.2460 - acc: 0.5531 - val_loss: 1.0272 - val_acc: 0.6311

deep@deep-Precision-7710:~/Downloads/keras/examples$ python cifar10_cnn.py 
Using Theano backend.
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
50000/50000 [==============================] - 292s - loss: 1.8008 - acc: 0.3286 - val_loss: 1.4991 - val_acc: 0.4613
Epoch 2/200
50000/50000 [==============================] - 285s - loss: 1.4302 - acc: 0.4774 - val_loss: 1.1840 - val_acc: 0.5737
Epoch 3/200
50000/50000 [==============================] - 288s - loss: 1.2690 - acc: 0.5452 - val_loss: 1.0930 - val_acc: 0.6030

运行大约需要15.75小时

keras theano后端比keras tensorflow后端慢约18X-19X，我是否应该感到惊讶？（包括我使用的cifar10_cnn.py版本）在从一个后端转换到另一个后端时，我刚刚更改了keras.json中的后端规范。我没有调整图像的大小顺序，因为这似乎是由数据集指定的

'''Train a simple deep CNN on the CIFAR10 small images dataset.

GPU run command with Theano backend (with TensorFlow, the GPU is automatically used):
    THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python cifar10_cnn.py

It gets down to 0.65 test logloss in 25 epochs, and down to 0.55 after 50 epochs.
(it's still underfitting at that point, though).
'''

from __future__ import print_function
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import SGD
from keras.utils import np_utils

import datetime

batch_size = 32
nb_classes = 10
nb_epoch = 200
data_augmentation = True

# input image dimensions
img_rows, img_cols = 32, 32
# The CIFAR10 images are RGB.
img_channels = 3

# The data, shuffled and split between train and test sets:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same',
                        input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# Let's train the model using SGD + momentum (how original).
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(X_train, Y_train,
              batch_size=batch_size,
              nb_epoch=nb_epoch,
              validation_data=(X_test, Y_test),
              shuffle=True)
else:
    print('Using real-time data augmentation.')
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

    # Compute quantities required for featurewise normalization
    # (std, mean, and principal components if ZCA whitening is applied).
    datagen.fit(X_train)

    start = datetime.datetime.now()
    # Fit the model on the batches generated by datagen.flow().
    model.fit_generator(datagen.flow(X_train, Y_train,
                                     batch_size=batch_size),
                        samples_per_epoch=X_train.shape[0],
                        nb_epoch=nb_epoch,
                        validation_data=(X_test, Y_test))
    stop = datetime.datetime.now()
    print("\nTime to run:",stop-start)

看起来theano没有找到您的gpu，您应该使用gpu设备0:GeForce GTX TITAN X获得以下形式的消息

@y300谢谢，这就是问题所在。Theano后端现在以38秒/历元的速度运行，速度只有原来的2倍多一点。我仍然对这种差异感到好奇，但至少这个世界又有了意义。我认为这可能发生，取决于你的GPU。另外，请确保已安装并检测到cuDNN。@y300 cuDNN已安装并检测到（尽管其cuDNN 5110略高于受支持的版本。我的GPU是Quadro M5500M。是否有任何理由认为Tensorflow使用此GPU会比Theano更好？