TPU（tensorflow）上的全局批量培训_Tensorflow_Neural Network_Tensorflow2.0_Tpu_Batchsize

TPU（tensorflow）上的全局批量培训

tensorflow neural-network

TPU（tensorflow）上的全局批量培训,tensorflow,neural-network,tensorflow2.0,tpu,batchsize,Tensorflow,Neural Network,Tensorflow2.0,Tpu,Batchsize,我最近在Google Colab上启动了一个神经网络项目，我发现我可以使用TPU。我一直在研究如何使用它，我发现了tensorflow的TPUStrategy（我使用的是tensorflow 2.2.0），并且能够成功地定义模型并在TPU上运行训练步骤然而，我不太清楚这意味着什么。可能是因为我没有充分阅读谷歌的TPU指南，但我的意思是我不知道在火车上到底发生了什么该指南要求您定义一个GLOBAL\u BATCH\u SIZE，每个TPU核心采用的批大小由per\u replica\u BAT

我最近在Google Colab上启动了一个神经网络项目，我发现我可以使用TPU。我一直在研究如何使用它，我发现了tensorflow的

TPUStrategy

（我使用的是tensorflow 2.2.0），并且能够成功地定义模型并在TPU上运行训练步骤

然而，我不太清楚这意味着什么。可能是因为我没有充分阅读谷歌的TPU指南，但我的意思是我不知道在火车上到底发生了什么

该指南要求您定义一个

GLOBAL\u BATCH\u SIZE

，每个TPU核心采用的批大小由

per\u replica\u BATCH\u SIZE=GLOBAL\u BATCH\u SIZE/strategy.num\u replications\u in\u sync

给出，这意味着每个TPU的批大小小于您开始时的批大小。在Colab上，

strategy.num\u replications\u in\u sync=8

，这意味着如果我以64的

GLOBAL\u BATCH\u SIZE

开始，则每个副本\u BATCH\u SIZE为8

现在，我不明白的是，当我计算一个训练步长时，优化器是否会对每个副本的批量大小计算8个不同的步长，并在8个不同的时间更新模型的权重，或者，它只是以这种方式并行计算训练步骤，最终只计算一批大小

全局\u batch\u size

的优化器步骤。谢谢。

这是一个好问题，与分销策略相关经过这个,这个解释, 我可以这么说 > the optimizer computes 8 different steps on batches of size > per_replica_batch_size, updating the weights of the model 8 different > times 以下解释应澄清： > So, how should the loss be calculated when using a > tf.distribute.Strategy? > > For an example, let's say you have 4 GPU's and a batch size of 64. One > batch of input is distributed across the replicas (4 GPUs), each > replica getting an input of size 16. > > The model on each replica does a forward pass with its respective > input and calculates the loss. Now, instead of dividing the loss by > the number of examples in its respective input (BATCH_SIZE_PER_REPLICA > = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64). 在下面提供其他链接的解释（以防万一它们将来不起作用）：国家：以下是对这一问题的解释： > `Synchronous vs asynchronous training`: These are two common ways of > `distributing training` with `data parallelism`. In `sync training`, all > `workers` train over different slices of input data in `sync`, and > **`aggregating gradients`** at each step. In `async` training, all workers are > independently training over the input data and updating variables > `asynchronously`. Typically sync training is supported via all-reduce > and `async` through parameter server architecture. 您还可以通过本课程详细了解所有减少的概念下面的屏幕截图显示了所有_Reduce的工作原理：这是一个好问题，与分销策略更相关经过这个,这个解释, 我可以这么说 > the optimizer computes 8 different steps on batches of size > per_replica_batch_size, updating the weights of the model 8 different > times 以下解释应澄清： > So, how should the loss be calculated when using a > tf.distribute.Strategy? > > For an example, let's say you have 4 GPU's and a batch size of 64. One > batch of input is distributed across the replicas (4 GPUs), each > replica getting an input of size 16. > > The model on each replica does a forward pass with its respective > input and calculates the loss. Now, instead of dividing the loss by > the number of examples in its respective input (BATCH_SIZE_PER_REPLICA > = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64). 在下面提供其他链接的解释（以防万一它们将来不起作用）：国家：以下是对这一问题的解释： > `Synchronous vs asynchronous training`: These are two common ways of > `distributing training` with `data parallelism`. In `sync training`, all > `workers` train over different slices of input data in `sync`, and > **`aggregating gradients`** at each step. In `async` training, all workers are > independently training over the input data and updating variables > `asynchronously`. Typically sync training is supported via all-reduce > and `async` through parameter server architecture. 您还可以通过本课程详细了解所有减少的概念下面的屏幕截图显示了所有_Reduce的工作原理：谢谢您的回答。我在做同步训练，所以所有的梯度都是聚合的，对吗？但这难道不意味着它只是用求和的梯度更新权重一次，而不是8次吗？还是说权重在相同的梯度下更新了8次？谢谢你的回答。我在做同步训练，所以所有的梯度都是聚合的，对吗？但这难道不意味着它只是用求和的梯度更新权重一次，而不是8次吗？还是说权重在相同的梯度下更新了8次？