如何使分布式tensorFlow支持故障切换?

如何使分布式tensorFlow支持故障切换?,tensorflow,Tensorflow,我创建了一个4节点的tensorflow集群,2个worker,2个ps。当worker或ps出现故障时,我在机器上使用相同的配置重新启动它。但是,它无法从检查点继续工作。 分布式tensorflow是否仍不支持故障切换 此文件“model.ckpt-24”仅在机器上(worker的任务0),跟踪如下所示 Traceback (most recent call last): File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/app

我创建了一个4节点的tensorflow集群,2个worker,2个ps。当worker或ps出现故障时,我在机器上使用相同的配置重新启动它。但是,它无法从检查点继续工作。 分布式tensorflow是否仍不支持故障切换

此文件“model.ckpt-24”仅在机器上(worker的任务0),跟踪如下所示

Traceback (most recent call last):
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 87, in main
with sv.managed_session(server.target) as sess:
File "/usr/lib64/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 969, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 797, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 958, in managed_session
start_standard_services=start_standard_services)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 715, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
config=config)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1345, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]

Caused by op u'save/restore_slice_1', defined at:
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 71, in main
saver = tf.train.Saver()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 986, in init
self.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1015, in build
restore_sequentially=self._restore_sequentially)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 620, in build
restore_sequentially, reshape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 357, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 270, in restore_op
preferred_shard=preferred_shard))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 204, in _restore_slice
preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 359, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]
我创建了一个4节点的tensorflow集群,2个worker,2个ps。当worker或ps出现故障时,我在机器上使用相同的配置重新启动它。但是,它无法从检查点继续工作。 分布式tensorflow是否仍不支持故障切换

此文件“model.ckpt-24”仅在机器上(worker的任务0),跟踪如下所示

Traceback (most recent call last):
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 87, in main
with sv.managed_session(server.target) as sess:
File "/usr/lib64/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 969, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 797, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 958, in managed_session
start_standard_services=start_standard_services)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 715, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
config=config)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1345, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]

Caused by op u'save/restore_slice_1', defined at:
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 71, in main
saver = tf.train.Saver()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 986, in init
self.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1015, in build
restore_sequentially=self._restore_sequentially)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 620, in build
restore_sequentially, reshape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 357, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 270, in restore_op
preferred_shard=preferred_shard))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 204, in _restore_slice
preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 359, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]
回溯(最近一次呼叫最后一次):
文件“/dump/9/nm local dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py”,第107行,in
tf.app.run()
文件“/usr/lib/python2.7/site packages/tensorflow/python/platform/app.py”,第30行,正在运行
系统出口(主(系统argv[:1]+标志通过))
文件“/dump/9/nm local dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py”,主行第87行
将sv.managed_会话(server.target)作为sess:
文件“/usr/lib64/python2.7/contextlib.py”,第17行,输入
返回self.gen.next()
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/supervisor.py”,第969行,在托管会话中
self.stop(close\u summary\u writer=close\u summary\u writer)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/supervisor.py”,第797行,在stop中
停止\宽限期\秒=自我。\停止\宽限期\秒)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/coordinator.py”,第386行,在join中
六、重新提升(*自我执行信息提升)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/supervisor.py”,第958行,在托管会话中
启动\标准\服务=启动\标准\服务)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/supervisor.py”,第715行,在准备或等待会话中
init_feed_dict=self._init_feed_dict,init_fn=self._init_fn)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/session\u manager.py”,第227行,在prepare\u session中
config=config)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/session_manager.py”,第173行,位于检查点中
saver.restore(sess、ckpt.model\u检查点\u路径)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第1345行,在restore中
{self.saver\u def.filename\u tensor\u name:save\u path})
文件“/usr/lib/python2.7/site packages/tensorflow/python/client/session.py”,第717行,正在运行
运行_元数据_ptr)
文件“/usr/lib/python2.7/site packages/tensorflow/python/client/session.py”,第915行,正在运行
提要(dict字符串、选项、运行元数据)
文件“/usr/lib/python2.7/site packages/tensorflow/python/client/session.py”,第965行,运行
目标\u列表、选项、运行\u元数据)
文件“/usr/lib/python2.7/site packages/tensorflow/python/client/session.py”,第985行,在
提升类型(e)(节点定义、操作、消息)
tensorflow.python.framework.errors.InvalidArgumentError:不成功的TensorSliceReader构造函数:无法在/dump/6/nm logs/application_1477899492621_0004/container_e07_1477899492621_1477899492621_01_000003/model.ckpt-24上获取匹配文件:未找到:/dump/6/nm logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node:save/restore\u slice\u 1=RestoreSlice[dt=dt\u FLOAT,preferred\u shard=-1,\u device=“/job:ps/replica:0/task:1/cpu:0”](\u recv\u save/Const\u 0\u S3,save/restore\u slice\u 1/tensor\u name,save/restore\u slice\u 1/shape\u and\u slice)]]
由操作u“保存/恢复切片1”引起,定义于:
文件“/dump/9/nm local dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py”,第107行,in
tf.app.run()
文件“/usr/lib/python2.7/site packages/tensorflow/python/platform/app.py”,第30行,正在运行
系统出口(主(系统argv[:1]+标志通过))
文件“/dump/9/nm local dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py”,主行第71行
saver=tf.train.saver()
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第986行,在init中
self.build()
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第1015行,内部版本
按顺序还原=自。_按顺序还原)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第620行,内部版本
恢复(按顺序,重塑)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第357行,在_AddRestoreOps中
tensor=self.restore\u op(文件名\u tensor,可保存,首选\u碎片)
文件“/usr/lib/python2.7/site packages/tensorflow/python/training/saver.py”,第270行,在restore\u op中
首选切分=首选切分)
文件“/usr/lib/python2.7/site packages/tensorflow/python/ops/io_ops.py”,第204行,在_restore_切片中
首选\u碎片,名称=名称)
文件“/usr/lib/python2.7/site packages/tensorflow/python/ops/gen_io_ops.py”,第359行,在还原切片中
首选切分=首选切分,名称=名称)
文件“/usr/lib/python2.7/site packages/tensorflow/python/framework/op_def_library.py”,第749行,在apply_op
op_def=op_def)
文件“/usr/lib/python2.7/site packages/tensorflow/python/framework/ops.py”,第2380行,在create_op中
初始值=自身值。\默认值\初始值,初始值=初始值)
init中的文件“/usr/lib/python2.7/site packages/tensorflow/python/framework/ops.py”,第1298行
self.\u traceback=\u extract\u stack()
InvalidArgumentError(回溯见上文):不成功的TensorSliceReader构造函数:未能在/dump/6/nm日志/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24上获取匹配文件:未找到:/dump/6/nm日志/application_1477899492621_00