Python 分布式tensorflow保护程序错误
我正在使用一台cpu机器和gpu机器。代码在cpu机器上。在非分布式tensorflow中,tf.train.Saver在cpu机器上运行良好。但是,当我使用gpu机器在分布式tensorflow下运行代码时,它无法保存 保存时,会显示找不到tempstatexxxxx文件。在我向保护程序添加sharded=True之后,它只创建一个检查点文件和一个元文件。检查点文件中列出的文件不存在。那么恢复就不能工作了 我能做什么Python 分布式tensorflow保护程序错误,python,tensorflow,Python,Tensorflow,我正在使用一台cpu机器和gpu机器。代码在cpu机器上。在非分布式tensorflow中,tf.train.Saver在cpu机器上运行良好。但是,当我使用gpu机器在分布式tensorflow下运行代码时,它无法保存 保存时,会显示找不到tempstatexxxxx文件。在我向保护程序添加sharded=True之后,它只创建一个检查点文件和一个元文件。检查点文件中列出的文件不存在。那么恢复就不能工作了 我能做什么 import tensorflow as tf from common im
import tensorflow as tf
from common import tfec2
save_dir = "/tmp"
W = tf.Variable(tf.zeros([784, 10]), name="weights")
b = tf.Variable(tf.zeros([10]), name="bias")
step = tf.Variable(0)
step_length = tf.Variable(1)
saver = tf.train.Saver(tf.all_variables(), sharded=True)
ckpt = tf.train.get_checkpoint_state(save_dir)
init = tf.initialize_all_variables()
with tfec2.TFEc2() as sess: # distributed
# with tf.Session() as sess:
print ckpt
if ckpt:
print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
res_obj = saver.restore(sess, ckpt.model_checkpoint_path)
print res_obj
else:
print("Created model with fresh parameters.")
sess.run(init)
print step.eval(session=sess)
# print sess.run(step)
op_a = tf.add(step, step_length)
op = tf.assign(step, op_a)
sess.run(op)
print step.eval(session=sess)
print("Start Save")
saver.save(sess, save_dir+"/example.ckpt", step)
print("End Save")
一些错误日志
Traceback (most recent call last):
File "example.py", line 78, in <module>
saver.save(sess, save_dir+"/example.ckpt", step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1037, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 340, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 564, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 637, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 659, in _do_call
e.code)
tensorflow.python.framework.errors.NotFoundError: /tmp/model/example.ckpt-1-00000-of-00001.tempstate3052208956008812953
[[Node: save/save = SaveSlices[T=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], _device="/job:worker/replica:0/task:0/cpu:0"](save/ShardedFilename, save/save/tensor_names, save/save/shapes_and_slices, Variable, Variable_1, bias_G209, weights_G211)]]
Caused by op u'save/save', defined at:
File "example.py", line 34, in <module>
saver = tf.train.Saver(tf.all_variables(), sharded=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
restore_sequentially=restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 496, in build
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 218, in _AddShardedSaveOps
sharded_saves.append(self._AddSaveOps(sharded_filename, vars_to_save))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 197, in _AddSaveOps
save = self.save_op(filename_tensor, vars_to_save)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 149, in save_op
tensor_slices=[vs.slice_spec for vs in vars_to_save])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 172, in _save
tensors, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 341, in _save_slices
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 661, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
self._traceback = _extract_stack()
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job saver -> {localhost:2223}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {172.31.26.237:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2223
Session Start!!!!!
model_checkpoint_path: "/tmp/model.ckpt-?????-of-00001"
all_model_checkpoint_paths: "/tmp/model.ckpt-?????-of-00001"
Reading model parameters from /tmp/model.ckpt-?????-of-00001
Traceback (most recent call last):
File "example.py", line 65, in <module>
res_obj = saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1088, in restore
raise ValueError("Restore called with invalid save path %s" % save_path)
ValueError: Restore called with invalid save path /tmp/model.ckpt-?????-of-00001
但若cpu机器中并没有val,或者gpu机器中并没有saver obj,它就不能工作
with tf.device("/job:worker"):
W = tf.Variable(tf.zeros([784, 10]), name="weights")
b = tf.Variable(tf.zeros([10]), name="bias")
step = tf.Variable(0)
step_length = tf.Variable(1)
op_a = tf.add(step, step_length)
op = tf.assign(step, op_a)
# saver = tf.train.Saver(tf.all_variables(), sharded=True)
with tf.device("/job:saver"):
# W = tf.Variable(tf.zeros([784, 10]), name="weights")
# b = tf.Variable(tf.zeros([10]), name="bias")
# step = tf.Variable(0)
# step_length = tf.Variable(1)
# op_a = tf.add(step, step_length)
# op = tf.assign(step, op_a)
saver = tf.train.Saver(tf.all_variables(), sharded=True)
或
错误是
Session Start!!!!!
model_checkpoint_path: "/tmp/model/example.ckpt-1-?????-of-00001"
all_model_checkpoint_paths: "/tmp/model/example.ckpt-1-?????-of-00001"
Reading model parameters from /tmp/model/example.ckpt-1-?????-of-00001
Traceback (most recent call last):
File "example.py", line 64, in <module>
res_obj = saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1088, in restore
raise ValueError("Restore called with invalid save path %s" % save_path)
ValueError: Restore called with invalid save path /tmp/model/example.ckpt-1-?????-of-00001
会话开始!!!!!
模型检查点路径:“/tmp/model/example.ckpt-1-??-of-00001”
所有检查点路径:“/tmp/model/example.ckpt-1-??-of-00001”
从/tmp/model/example.ckpt-1-?????-of-00001读取模型参数
回溯(最近一次呼叫最后一次):
文件“example.py”,第64行,在
res\u obj=saver.restore(sess,ckpt.model\u checkpoint\u路径)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py”,第1088行,在restore中
raise VALUERROR(“使用无效的存储路径%s”%save\u路径调用还原”)
ValueError:使用无效的保存路径/tmp/model/example调用Restore.ckpt-1-??-of-00001
此示例.ckpt-1-??-of-00001在gpu机器上,在cpu mahcine中只有检查点和.mate文件您可以包括显示的完整错误消息(包括堆栈跟踪)吗?您可以包括显示的完整错误消息(包括堆栈跟踪)吗?
with tf.device("/job:worker"):
# W = tf.Variable(tf.zeros([784, 10]), name="weights")
b = tf.Variable(tf.zeros([10]), name="bias")
step = tf.Variable(0)
step_length = tf.Variable(1)
op_a = tf.add(step, step_length)
op = tf.assign(step, op_a)
saver = tf.train.Saver(tf.all_variables(), sharded=True)
with tf.device("/job:saver"):
W = tf.Variable(tf.zeros([784, 10]), name="weights")
# b = tf.Variable(tf.zeros([10]), name="bias")
# step = tf.Variable(0)
# step_length = tf.Variable(1)
# op_a = tf.add(step, step_length)
# op = tf.assign(step, op_a)
# saver = tf.train.Saver(tf.all_variables(), sharded=True)
Session Start!!!!!
model_checkpoint_path: "/tmp/model/example.ckpt-1-?????-of-00001"
all_model_checkpoint_paths: "/tmp/model/example.ckpt-1-?????-of-00001"
Reading model parameters from /tmp/model/example.ckpt-1-?????-of-00001
Traceback (most recent call last):
File "example.py", line 64, in <module>
res_obj = saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1088, in restore
raise ValueError("Restore called with invalid save path %s" % save_path)
ValueError: Restore called with invalid save path /tmp/model/example.ckpt-1-?????-of-00001