TPU(tensorflow)上的全局批量培训

TPU(tensorflow)上的全局批量培训,tensorflow,neural-network,tensorflow2.0,tpu,batchsize,Tensorflow,Neural Network,Tensorflow2.0,Tpu,Batchsize,我最近在Google Colab上启动了一个神经网络项目,我发现我可以使用TPU。我一直在研究如何使用它,我发现了tensorflow的TPUStrategy(我使用的是tensorflow 2.2.0),并且能够成功地定义模型并在TPU上运行训练步骤 然而,我不太清楚这意味着什么。可能是因为我没有充分阅读谷歌的TPU指南,但我的意思是我不知道在火车上到底发生了什么 该指南要求您定义一个GLOBAL\u BATCH\u SIZE,每个TPU核心采用的批大小由per\u replica\u BAT

我最近在Google Colab上启动了一个神经网络项目,我发现我可以使用TPU。我一直在研究如何使用它,我发现了tensorflow的
TPUStrategy
(我使用的是tensorflow 2.2.0),并且能够成功地定义模型并在TPU上运行训练步骤

然而,我不太清楚这意味着什么。可能是因为我没有充分阅读谷歌的TPU指南,但我的意思是我不知道在火车上到底发生了什么

该指南要求您定义一个
GLOBAL\u BATCH\u SIZE
,每个TPU核心采用的批大小由
per\u replica\u BATCH\u SIZE=GLOBAL\u BATCH\u SIZE/strategy.num\u replications\u in\u sync
给出,这意味着每个TPU的批大小小于您开始时的批大小。在Colab上,
strategy.num\u replications\u in\u sync=8
,这意味着如果我以64的
GLOBAL\u BATCH\u SIZE
开始,则每个副本\u BATCH\u SIZE为8


现在,我不明白的是,当我计算一个训练步长时,优化器是否会对每个副本的批量大小计算8个不同的步长,并在8个不同的时间更新模型的权重,或者,它只是以这种方式并行计算训练步骤,最终只计算一批大小
全局\u batch\u size
的优化器步骤。谢谢。

这是一个好问题,与
分销策略相关

经过这个,这个解释,

我可以这么说

> the optimizer computes 8 different steps on batches of size
> per_replica_batch_size, updating the weights of the model 8 different
> times
以下解释应澄清:

> So, how should the loss be calculated when using a
> tf.distribute.Strategy?
> 
> For an example, let's say you have 4 GPU's and a batch size of 64. One
> batch of input is distributed across the replicas (4 GPUs), each
> replica getting an input of size 16.
> 
> The model on each replica does a forward pass with its respective
> input and calculates the loss. Now, instead of dividing the loss by
> the number of examples in its respective input (BATCH_SIZE_PER_REPLICA
> = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64).
在下面提供其他链接的解释(以防万一它们将来不起作用):

国家:

以下是对这一问题的解释:

> `Synchronous vs asynchronous training`: These are two common ways of
> `distributing training` with `data parallelism`. In `sync training`, all
> `workers` train over different slices of input data in `sync`, and
> **`aggregating gradients`** at each step. In `async` training, all workers are
> independently training over the input data and updating variables
> `asynchronously`. Typically sync training is supported via all-reduce
> and `async` through parameter server architecture.
您还可以通过本课程详细了解所有减少的概念

下面的屏幕截图显示了所有_Reduce的工作原理:


这是一个好问题,与
分销策略更相关

经过这个,这个解释,

我可以这么说

> the optimizer computes 8 different steps on batches of size
> per_replica_batch_size, updating the weights of the model 8 different
> times
以下解释应澄清:

> So, how should the loss be calculated when using a
> tf.distribute.Strategy?
> 
> For an example, let's say you have 4 GPU's and a batch size of 64. One
> batch of input is distributed across the replicas (4 GPUs), each
> replica getting an input of size 16.
> 
> The model on each replica does a forward pass with its respective
> input and calculates the loss. Now, instead of dividing the loss by
> the number of examples in its respective input (BATCH_SIZE_PER_REPLICA
> = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64).
在下面提供其他链接的解释(以防万一它们将来不起作用):

国家:

以下是对这一问题的解释:

> `Synchronous vs asynchronous training`: These are two common ways of
> `distributing training` with `data parallelism`. In `sync training`, all
> `workers` train over different slices of input data in `sync`, and
> **`aggregating gradients`** at each step. In `async` training, all workers are
> independently training over the input data and updating variables
> `asynchronously`. Typically sync training is supported via all-reduce
> and `async` through parameter server architecture.
您还可以通过本课程详细了解所有减少的概念

下面的屏幕截图显示了所有_Reduce的工作原理:


谢谢您的回答。我在做同步训练,所以所有的梯度都是聚合的,对吗?但这难道不意味着它只是用求和的梯度更新权重一次,而不是8次吗?还是说权重在相同的梯度下更新了8次?谢谢你的回答。我在做同步训练,所以所有的梯度都是聚合的,对吗?但这难道不意味着它只是用求和的梯度更新权重一次,而不是8次吗?还是说权重在相同的梯度下更新了8次?