Tensorflow:对象检测-Tensorflow/core/distributed_runtime/master.cc:269]master init:不可用:操作系统错误

Tensorflow:对象检测-Tensorflow/core/distributed_runtime/master.cc:269]master init:不可用:操作系统错误,tensorflow,Tensorflow,我正试图在分布式模式下使用 该脚本支持谷歌云的分布式模式。为了在集群上使用它,我在脚本中设置了TFCONFIG env变量,如下所示 chief_host = ['host2:2229'] worker_hosts = ['host1:2230'] ps_hosts = ['host1:22231'] cluster = {'master': chief_host, 'worker': worker_hosts, 'ps': ps_hosts} os.

我正试图在分布式模式下使用 该脚本支持谷歌云的分布式模式。为了在集群上使用它,我在脚本中设置了TFCONFIG env变量,如下所示

chief_host = ['host2:2229']
worker_hosts = ['host1:2230']
ps_hosts = ['host1:22231']
cluster = {'master': chief_host,
           'worker': worker_hosts,
           'ps': ps_hosts}
os.environ['TF_CONFIG'] = json.dumps({'cluster': cluster,                                        
                                      'task': {'type': FLAGS.job_name, index': FLAGS.task_index}})                                

env = json.loads(os.environ.get('TF_CONFIG', '{}'))
我正在使用以下命令运行培训脚本:

On host1:
python3 object_detection/train.py   --logtostderr --   pipeline_config_path=/home/.../faster_rcnn_resnet101_coco.config --train_dir=/home/.../modelOutput5th --job_name="ps" --task_index=0 --clone_on_cpu=true

On host2:
python object_detection/train.py   --logtostderr --pipeline_config_path=/home/.../faster_rcnn_resnet101_coco.config --train_dir=/home/.../modelOutput5th --job_name="master" --task_index=0 --clone_on_cpu=true

On host1:
python3 object_detection/train.py   --logtostderr -- pipeline_config_path=/home/.../faster_rcnn_resnet101_coco.config --train_dir=/home/.../modelOutput5th --job_name="worker" --task_index=0 --clone_on_cpu=true
它启动ps、master worker和另一个worker fine。有时,训练进行得很顺利,继续进行 并行训练,但有时它只是在工作主机上出现以下错误而失败

 tensorflow/core/distributed_runtime/master.cc:269] Master init: 
 Unavailable: OS Error
INFO:tensorflow:Error reported to Coordinator: <class 
  'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
Traceback (most recent call last):
  File "object_detection/train.py", line 185, in <module>
tf.app.run()
  File "/home/sarmin/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
  File "object_detection/train.py", line 181, in main
worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/..../object_detection/trainer.py", line 377, in train
saver=saver)
  File "/home/...lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 746, in train
master, start_standard_services=False, config=session_config) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
  File "/home/.../python2.7/site-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
  File "/home/.../python2.7/site-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
  File "/home/...../training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
  File "/home/./python2.7/site-..packages/tensorflow/python/training/supervisor.py", line 989, in managed_session
start_standard_services=start_standard_services)
  File "/home/.../python2.7/site-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/home/.../python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
  File "/home/.../python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
  File "/home/.../python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
  File "/home/.../python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
  File "/home/..../python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
tensorflow/core/distributed_runtime/master.cc:269]master init:
不可用:操作系统错误
信息:tensorflow:向协调器报告的错误:,操作系统错误
回溯(最近一次呼叫最后一次):
文件“object_detection/train.py”,第185行,在
tf.app.run()
文件“/home/sarmin/.local/lib/python2.7/site packages/tensorflow/python/platform/app.py”,第126行,正在运行
_系统出口(主(argv))
文件“object_detection/train.py”,第181行,主目录
工人\工作\姓名,是\旗舰队队长。列车长)
文件“/home/…/object\u detection/trainer.py”,第377行,列车中
储蓄者=储蓄者)
文件“/home/…lib/python2.7/site packages/tensorflow/contrib/slim/python/slim/learning.py”,第746行,列车中
主服务器,启动\u标准\u服务=False,配置=session\u配置)作为sess:
文件“/usr/lib/python2.7/contextlib.py”,第17行,输入__
返回self.gen.next()
文件“/home/../python2.7/site packages/tensorflow/python/training/supervisor.py”,第1000行,在托管会话中
self.stop(close\u summary\u writer=close\u summary\u writer)
文件“/home/../python2.7/site packages/tensorflow/python/training/supervisor.py”,第828行,在stop中
忽略\u活动\u线程=忽略\u活动\u线程)
文件“/home/…../training/coordinator.py”,第389行,加入
六、重新提升(*自我执行信息提升)
文件“/home//python2.7/site-…packages/tensorflow/python/training/supervisor.py”,第989行,在托管会话中
启动\标准\服务=启动\标准\服务)
文件“/home/../python2.7/site packages/tensorflow/python/training/supervisor.py”,第726行,在准备或等待会话中
init_feed_dict=self._init_feed_dict,init_fn=self._init_fn)
文件“/home/../python2.7/site packages/tensorflow/python/training/session\u manager.py”,第281行,在prepare\u session中
sess.run(init_op,feed_dict=init_feed_dict)
文件“/home/../python2.7/site packages/tensorflow/python/client/session.py”,第905行,正在运行
运行_元数据_ptr)
文件“/home/../python2.7/site packages/tensorflow/python/client/session.py”,第1137行,在运行时
feed_dict_tensor、options、run_元数据)
文件“/home/../python2.7/site packages/tensorflow/python/client/session.py”,第1355行,在运行时
选项,运行(元数据)
文件“/home/…/python2.7/site packages/tensorflow/python/client/session.py”,第1374行,在
提升类型(e)(节点定义、操作、消息)
tensorflow.python.framework.errors\u impl.UnavailableError:OS错误
我不知道原因是什么,请帮忙