Google cloud platform 谷歌云TPU上的紧急错误_Google Cloud Platform_Google Compute Engine_Panic_Google Cloud Tpu

Google cloud platform 谷歌云TPU上的紧急错误

google-cloud-platform google-compute-engine

Google cloud platform 谷歌云TPU上的紧急错误,google-cloud-platform,google-compute-engine,panic,google-cloud-tpu,Google Cloud Platform,Google Compute Engine,Panic,Google Cloud Tpu,我可以打开一个ctpu会话并从git存储库中获取所需的代码，但当我从cloud shell运行tensorflow代码时，会收到一条消息，表示没有TPU，我的程序崩溃。以下是我收到的错误消息： adrien_doerig@adrien-doerig:~/capser$ python TPU_playground.py (unset) INFO:tensorflow:Querying Tensorflow master () for TPU system metadata. 2018-07-16

我可以打开一个ctpu会话并从git存储库中获取所需的代码，但当我从cloud shell运行tensorflow代码时，会收到一条消息，表示没有TPU，我的程序崩溃。以下是我收到的错误消息：

adrien_doerig@adrien-doerig:~/capser$ python TPU_playground.py
(unset)
INFO:tensorflow:Querying Tensorflow master () for TPU system metadata.
2018-07-16 09:45:49.951310: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Failed to find TPU: _TPUSystemMetadata(num_cores=0, num_hosts=0, num_of_cores_per_host=0, topology=None, devices=[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456)])
Traceback (most recent call last):
File "TPU_playground.py", line 79, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 363, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2068, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 339, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 525, in _validate_tpu_configuration
'are {}.'.format(tpu_system_metadata.devices))
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU

我尝试了此处建议的故障排除：但它不起作用，因为当我进入时一切似乎都正常

gcloud compute tpus list

我也尝试过创建一个全新的项目，甚至使用不同的谷歌账户，但没有解决问题。关于云TPU，我没有发现任何类似的错误。我错过了什么明显的东西吗

谢谢你的帮助

好吧，我想起来了：

我需要添加一个master=。。。我的RunConfig的参数如下（以下代码的第2行）：

my\u tpu\u run\u config=tpu\u config.RunConfig(
master=TPUClusterResolver（tpu=[os.environ['tpu_NAME']]）。get_master（），
model_dir=FLAGS.model_dir，
save_checkpoints_secs=标志。save_checkpoints_secs，
save_summary_steps=标志。save_summary_steps，
会话配置=tf.ConfigProto（允许软放置=True，日志设备放置=True），
tpu_config=tpu_config.TPUConfig（迭代次数每循环=FLAGS.iterations，num_碎片=FLAGS.num_碎片））

现在，当我输入“ctpu状态”（我在另一个虚拟机未运行的shell中执行此操作）时，仍然会出现恐慌错误，但我可以在云的TPU上运行任何东西，也就是说，我最初发布的第一条错误消息不再出现。所以使用master=。。。参数允许我运行我的程序，但我仍然不确定panic错误的含义——它可能并不重要。

ctpu中的panic现在可以忽略，它是由于未能检查从云TPU REST API返回的TPU节点对象中的

SchedulingConfig

字段是否已填充而导致的（因此，不是零）。这通过本PR解决：

一旦这个PR被整合到谷歌云壳中，噪音就会消失

gcloud compute tpus list