Tensorflow 在GKE上使用TPU:进料记录错误:插座关闭

Tensorflow 在GKE上使用TPU:进料记录错误:插座关闭,tensorflow,google-kubernetes-engine,tpu,Tensorflow,Google Kubernetes Engine,Tpu,偶尔,我们使用TPU的基于GKE TPUEstimator的培训工作会失败: Error recorded from infeed: Socket closed An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Thi

偶尔,我们使用TPU的基于GKE TPUEstimator的培训工作会失败:

Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
关于这一点,我有两个问题:

  • 这里发生了什么?我检查了pods的内存使用情况,但没有出现峰值。分配给pod的TPU也仍然存在
  • 作业并不总是向pod发出错误。除非有人手动检查状态,然后采取措施重新启动,否则它将继续显示为正在运行。有没有办法让它总是自动重启