Python TPU接收V3张量具有NaN值

Python TPU接收V3张量具有NaN值,python,tensorflow,tensorflow-estimator,tpu,Python,Tensorflow,Tensorflow Estimator,Tpu,我正在尝试使用inceptionV3和TPU生成图片的嵌入。作为一个基本代码,我正在使用inceptionV3的实验模型,但我改变了损失和批处理的生成方式。 我得到了这个错误 Traceback (most recent call last): File "Model/inception_v3.py", line 879, in <module> app.run(main) # starts Abseil app File "/usr/local/lib/python

我正在尝试使用inceptionV3和TPU生成图片的嵌入。作为一个基本代码,我正在使用inceptionV3的实验模型,但我改变了损失和批处理的生成方式。 我得到了这个错误

Traceback (most recent call last):
  File "Model/inception_v3.py", line 879, in <module>
    app.run(main) # starts Abseil app
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "Model/inception_v3.py", line 858, in main
    input_fn=imagenet_train.input_fn, steps=FLAGS.train_steps_per_eval)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Gradient for InceptionV3/Mixed_5b/Branch_3/Conv2d_0b_1x1/BatchNorm/beta:0 is NaN : Tensor had NaN values
         [[node CheckNumerics_23 (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
回溯(最近一次呼叫最后一次):
文件“Model/inception_v3.py”,第879行,在
应用程序运行(主)#启动下降应用程序
文件“/usr/local/lib/python2.7/dist-packages/absl/app.py”,第300行,运行中
_运行_main(main,args)
文件“/usr/local/lib/python2.7/dist packages/absl/app.py”,第251行,在主
系统出口(主(argv))
文件“Model/inception_v3.py”,第858行,主目录
输入\u fn=imagenet\u train.input\u fn,步长=标志。train\u steps\u per\u eval)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py”,第2457行,列车中
集合。引发错误()
文件“/usr/local/lib/python2.7/dist packages/tensorflow/contrib/tpu/python/tpu/error\u handling.py”,raise\u errors中的第128行
六、重放(类型、值、回溯)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py”,第2452行,列车中
保存\u侦听器=保存\u侦听器)
文件“/usr/local/lib/python2.7/dist packages/tensorflow_estimator/python/estimator/estimator.py”,第358行,列车中
损失=自我训练模型(输入、挂钩、保存侦听器)
文件“/usr/local/lib/python2.7/dist packages/tensorflow_estimator/python/estimator/estimator.py”,第1124行,in_train_模型
返回self.\u train\u model\u default(输入\u fn、挂钩、保存\u侦听器)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow\u-estimator/python/estimator/estimator.py”,第1158行,默认为列车模型
保存(侦听器)
文件“/usr/local/lib/python2.7/dist packages/tensorflow\u estimator/python/estimator/estimator.py”,第1407行,带\u estimator\u规范
_,损失=一次运行([estimator\u spec.train\u op,estimator\u spec.loss])
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第676行,正在运行
运行\元数据=运行\元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第1171行,正在运行
运行\元数据=运行\元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第1270行,正在运行
提出六个。重新提出(*原始交换信息)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第1255行,正在运行
返回自运行(*args,**kwargs)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第1327行,正在运行
运行\元数据=运行\元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/monitored_session.py”,第1091行,正在运行
返回自运行(*args,**kwargs)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第929行,正在运行
运行_元数据_ptr)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第1152行,正在运行
feed_dict_tensor、options、run_元数据)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第1328行,运行
运行(元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第1348行,在
提升类型(e)(节点定义、操作、消息)
tensorflow.python.framework.errors\u impl.invalidargumeinterror:InceptionV3/Mixed\u 5b/Branch\u 3/Conv2d\u 0b\u 1x1/BatchNorm/beta:0为NaN:Tensor具有NaN值
[[node CheckNumerics_23(定义于/usr/local/lib/python2.7/dist packages/tensorflow_estimator/python/estimator/estimator/estimator.py:1112)]]
我使用的损耗是tensorflow的三重态损耗


编辑:当我使用flag
--use tpu=False
时,它可以工作,但我在主机上看不到任何负载,但我在tpu上看到了一些用法。

你解决了这个问题吗,如果是,具体的解决方案是什么?@Kevin如果我没记错的话,它与图像大小有关。我必须放大图像以匹配预训练模型使用的图像大小。你解决了这个问题吗?如果是,具体解决方案是什么?@Kevin如果我没记错的话,它与图像大小有关。我必须将图像放大,以匹配预训练模型使用的图像大小。