Tensorflow 谷歌Colab：为什么CPU比TPU快？_Tensorflow_Keras_Deep Learning_Google Colaboratory_Google Cloud Tpu

Tensorflow 谷歌Colab：为什么CPU比TPU快？

tensorflow keras deep-learning google-colaboratory

Tensorflow 谷歌Colab：为什么CPU比TPU快？,tensorflow,keras,deep-learning,google-colaboratory,google-cloud-tpu,Tensorflow,Keras,Deep Learning,Google Colaboratory,Google Cloud Tpu,我正在使用Google colab TPU来训练一个简单的Keras模型。删除分布式策略并在CPU上运行相同的程序要比TPU快得多。这怎么可能 import timeit import os import tensorflow as tf from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from tensorflow.keras.models import Se

我正在使用Google colab TPU来训练一个简单的Keras模型。删除分布式策略并在CPU上运行相同的程序要比TPU快得多。这怎么可能

import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Load Iris dataset
x = load_iris().data
y = load_iris().target

# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)

# Convert train data type to use TPU 
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')

# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

# Use the strategy to create and compile a Keras model
with strategy.scope():
  model = Sequential()
  model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
  model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
  model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')

start = timeit.default_timer()

# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)

print('\nTime: ', timeit.default_timer() - start)

这可能是由于您使用的批次大小。与CPU和GPU相比，TPU的训练速度在很大程度上取决于批量大小。有关详细信息，请访问以下网站：

云TPU硬件不同于CPU和GPU。兴高采烈级别，CPU可以被描述为具有低数量的高执行线程。GPU可以被描述为具有非常高的性能低性能线程的数量。一个云TPU，具有128 x 128 矩阵单位，可以被认为是一个单一的，非常强大的线程，每个周期可执行16K操作，或128 x 128微小、简单以管道方式连接的线程。相应地，当寻址内存时，需要8个浮点数的倍数，以及以矩阵单元为目标的操作的128倍

这意味着批量大小应为128的倍数，具体取决于TPU的数量。Google Colab为您提供8个TPU，因此在最佳情况下，您应该选择128*8=1024的批大小。

谢谢您的提问

我认为这里发生的事情是一个开销问题——因为TPU运行在一个单独的VM上，可以通过grpc访问：//$COLAB_TPU_ADDR，每次调用在TPU上运行模型都会产生一定的开销，因为在这种情况下，客户机COLAB笔记本会向TPU发送一个图表，然后对其进行编译和运行。与运行一个历元（例如ResNet50）所需的时间相比，此开销较小，但与运行一个简单模型（如示例中的模型）相比，此开销较大

为了在TPU上获得最佳效果，我们建议使用。我更新了TensorFlow 2.2的示例：

%tensorflow_版本2.x 导入时间信息导入操作系统导入tensorflow作为tf 从sklearn.dataset导入加载从sklearn.model\u选择导入列车\u测试\u拆分从tensorflow.keras.models导入顺序从tensorflow.keras.layers导入稠密从tensorflow.keras.optimizers导入Adam 加载虹膜数据集 x=加载虹膜数据 y=加载目标将数据拆分为训练集和验证集 x_-train，x_-val，y_-train，y_-val=训练测试分割x，y，测试大小=0.30，随机播放=假将列车数据类型转换为使用TPU x_-train=x_-train.astype'float32' x_val=x_val.astype'float32' resolver=tf.distribute.cluster\u resolver.TPUClusterResolvertpu='grpc://'+os.environ['COLAB\u TPU\u ADDR'] tf.config.experimental\u将\u连接到\u clusterresolver tf.tpu.experimental.initialize\u tpu\u系统解析器策略=tf.distribute.experimental.tpustrategyrolver train\u dataset=tf.data.dataset.from\u tensor\u slicesx\u train，y\u train.batch20 val_dataset=tf.data.dataset.from_tensor_slicesx_val，y_val.batch20 使用该策略创建和编译Keras模型战略范围：模型=顺序 model.addDense32，输入_shape=4，activation=tf.nn.relu，name=relu model.addDense3，activation=tf.nn.softmax，name=softmax model.compileoptimizer=Adamlearning_rate=0.1，loss='logcosh' start=timeit.default\u定时器在数据集上拟合Keras模型 model.fittrain\u数据集，epochs=20，validation\u data=val\u数据集打印“\nTime:”，timeit.default\u timer-start 这大约需要30秒来运行，而在CPU上运行大约需要1.3秒。通过重复数据集并运行一个较长的历元（而不是几个较小的历元），我们可以大大减少开销。我用以下内容替换了数据集设置：

train\u dataset=tf.data.dataset.from\u tensor\u slicesx\u train，y\u train.repeat20.batch20 val_dataset=tf.data.dataset.from_tensor_slicesx_val，y_val.batch20 并将fit call替换为以下内容：

model.fittrain\u数据集，验证数据=val\u数据集

这使我的运行时间降低到6秒左右。这仍然比CPU慢，但对于这样一个可以轻松在本地运行的小型模型来说，这并不奇怪。一般来说，您将看到在较大型号上使用TPU的好处更多。我建议您仔细查看，它为MNIST数据集提供了一个更大的图像分类模型。

谢谢您的回复。我尝试使用128、512和1024的批处理大小，但TPU仍然比CPU慢。这可能是因为我必须将数据转换成张量吗？谢谢，使用一个历元确实减少了时间，但评估结果更糟！