Google compute engine TPU突然停止训练
我正试图按照中的说明,在谷歌云中使用TPU来训练transformer模型。 加载数据工作正常,运行后Google compute engine TPU突然停止训练,google-compute-engine,tpu,google-cloud-tpu,Google Compute Engine,Tpu,Google Cloud Tpu,我正试图按照中的说明,在谷歌云中使用TPU来训练transformer模型。 加载数据工作正常,运行后 t2t-trainer \ --model=transformer \ --hparams_set=transformer_tpu \ --problem=translate_ende_wmt32k_packed \ --train_steps=500000 \ --eval_steps=3000 \ --data_dir=$DATA_DIR \ --output_
t2t-trainer \
--model=transformer \
--hparams_set=transformer_tpu \
--problem=translate_ende_wmt32k_packed \
--train_steps=500000 \
--eval_steps=3000 \
--data_dir=$DATA_DIR \
--output_dir=$OUT_DIR \
--use_tpu=True \
--cloud_tpu_name=$TPU_NAME
培训确实如预期的那样开始,并且输出可能看起来有点像这样:
I1118 14:48:18.978163 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2942 [114/1944]
INFO:tensorflow:examples/sec: 978.827
I1118 14:48:18.978595 140580835792320 tpu_estimator.py:2308] examples/sec: 978.827
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I1118 14:48:18.979720 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:18.979935 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:24.292932 140577566803712 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.
W1118 14:48:24.353135 140577566803712 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.
INFO:tensorflow:loss = 1.8486812, step = 113800 (6.536 sec)
I1118 14:48:25.512768 140580835792320 basic_session_run_hooks.py:260] loss = 1.8486812, step = 113800 (6.536 sec)
INFO:tensorflow:global_step/sec: 15.2986
I1118 14:48:25.514695 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2986
INFO:tensorflow:examples/sec: 979.11
I1118 14:48:25.515115 140580835792320 tpu_estimator.py:2308] examples/sec: 979.11
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I1118 14:48:25.516618 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:25.516829 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (388, 47)
I1118 14:48:28.761935 140577575196416 tpu_estimator.py:279] Outfeed finished for iteration (388, 47)
INFO:tensorflow:loss = 1.5237397, step = 113900 (6.573 sec)
I1118 14:48:32.086134 140580835792320 basic_session_run_hooks.py:260] loss = 1.5237397, step = 113900 (6.573 sec)
然而,有时,在经过不确定的迭代次数(有时小于25k,有时大于400k,有时从不)后,训练突然停止。没有错误消息,但没有取得更多进展。在本例中,我得到以下输出:
I1120 13:40:33.828651 140684764419520 tpu_estimator.py:2307] global_step/sec: 16.3988
INFO:tensorflow:examples/sec: 1049.52
I1120 13:40:33.829339 140684764419520 tpu_estimator.py:2308] examples/sec: 1049.52
INFO:tensorflow:Enqueue next (1000) batch(es) of data to infeed.
I1120 13:40:33.830607 140684764419520 tpu_estimator.py:600] Enqueue next (1000) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1000) batch(es) of data from outfeed.
I1120 13:40:33.830862 140684764419520 tpu_estimator.py:604] Dequeue next (1000) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (7, 0)
I1120 13:40:34.267921 140681504278272 tpu_estimator.py:279] Outfeed finished for iteration (7, 0)
I1120 13:40:39.989195 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:40:40.056418 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:10.124164 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:10.177670 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:40.259634 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:40.309398 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:42:10.377460 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
W1120 13:42:10.431982 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
I1120 13:42:40.508342 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:42:40.567739 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:10.638391 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:10.694900 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:40.763782 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:40.810777 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:10.889873 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:10.942733 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:41.011034 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:41.066553 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
请注意,报告的健康状况是未知的,可能与此问题有关,也可能与此问题无关
要恢复训练,我必须停止进程并再次运行训练命令。然后,它将加载最新的检查点并继续训练,直到最终再次停止
我在tmux会话中运行训练命令,所以这不应该是由我和Google Cloud之间的连接问题引起的。事实上,我可以完全关闭所有窗口,并从另一台电脑连接到跑步训练课程
我已经看到了这个问题,但我使用的是一个预定义的模型,我的存储桶定义在同一个区域(TPU在us-central1-a
中,存储桶在us-central1
中)
编辑:如果这是相关的,我目前在一个免费的1个月的试用期,我通过申请该项目获得。也许那些集群节点不如付费节点稳定
Edit2:这可能与GitHub问题有关(和)?请注意,问题已经解决,但给出的答案似乎与问题无关。此外,我还检查了我的云VM中的文件
/usr/local/lib/python2.7/dist packages/tensorflow\u core/python/tpu/preempted\u hook.py
,两个链接的更改都已合并 我在训练TFRC的TPU时也遇到了同样的问题。正如警告所说,TPU和谷歌云之间的连接似乎存在问题,即使我们按照指示操作
我尝试了几种解决方案:
- 删除gcloud配置文件夹 rm-rf~/.config/gcloud
- 更新gcloud sdk: gcloud组件更新
- 允许TPU通过IAM访问云存储桶 !
TPU挂起错误仍然会发生,但频率较低。希望它能对您的案例有所帮助,或者您可以找到通用的解决方案。这被报告为GitHub(,)上的一个bug,随后被修复。
如果错误仍然发生,您应该回答第二个GitHub问题。请注意,您可能需要重新创建TPU,只是重新启动它可能还不够。谢谢,我已经尝试过了。但我不确定这是否真的有帮助——一个实验在开始后30分钟内就结束了。希望不久就会有一个普遍的解决办法。