Tensorflow 成功完成1000次后,云ML上的作业失败
我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-start-training-prediction,在该教程中,工作是成功的。然而,当我浏览有关花卉图像数据的本教程时:根据日志中1000个步骤的完成情况,我的培训任务似乎成功了。然而,从这个快照完成后,它说作业失败了。我曾尝试使用相同的结构替换普查数据演练中的命令行参数,删除并重新创建了JOB_ID和--output_path用户参数,使用了标准的_1 scale层,但没有效果。如果我能从社区得到任何帮助,我将不胜感激。谢谢 下面是错误,您可以看到在日志快照末尾弹出的错误:Tensorflow 成功完成1000次后,云ML上的作业失败,tensorflow,google-app-engine,google-cloud-platform,tensorflow-serving,google-cloud-ml,Tensorflow,Google App Engine,Google Cloud Platform,Tensorflow Serving,Google Cloud Ml,我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-start-training-prediction,在该教程中,工作是成功的。然而,当我浏览有关花卉图像数据的本教程时:根据日志中1000个步骤的完成情况,我的培训任务似乎成功了。然而,从这个快照完成后,它说作业失败了。我曾尝试使用相同的结构替换普查数据演练中的命令行参数,删除并重新创建了JOB_ID和--output_path用户参数,使用了标准的_
{
textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
return metric_values
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
run(model, argv)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
dispatch(args, model, cluster, task)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
Trainer(args, model, cluster, task).run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
self.eval(session)
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
self.model.format_metric_values(self.evaluator.evaluate()))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
self.eval_batch_size)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
num_epochs=None if is_training else 2)
File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
filename_queue, batch_size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
num_records=num_records, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
[[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***
{
textPayload:“副本主机0以非零状态1退出。终止原因:错误。
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python2.7/runpy.py”,第162行,在运行模块中作为主模块
“\uuuuu main\uuuuuuuuuuuuuuuuuuuuuuuuu”,fname,loader,pkg\u name)
文件“/usr/lib/python2.7/runpy.py”,第72行,在运行代码中
run_globals中的exec代码
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第542行,在
tf.app.run()
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第44行,正在运行
_系统出口(主(_sys.argv[:1]+标志_passthrough))
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第305行,主目录
运行(模型,argv)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第436行,运行中
分派(参数、模型、群集、任务)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第477行,发送中
培训师(参数、模型、集群、任务)。运行_培训()
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第241行,运行中培训
自我评估(会议)
eval中第283行的文件“/root/.local/lib/python2.7/site packages/trainer/task.py”
self.model.format\u metric\u值(self.evaluator.evaluate())
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第95行,在评估中
返回度量值
文件“/usr/lib/python2.7/contextlib.py”,第35行,在__
self.gen.throw(类型、值、回溯)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/supervisor.py”,第960行,在托管会话中
self.stop(close\u summary\u writer=close\u summary\u writer)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py”,第788行,在stop中
停止\宽限期\秒=自我。\停止\宽限期\秒)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py”,第386行,在join中
六、重新提升(*自我执行信息提升)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/queue\u runner\u impl.py”,第234行,in\u run
sess.run(排队)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第767行,正在运行
运行_元数据_ptr)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第965行,正在运行
提要(dict字符串、选项、运行元数据)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第1015行,运行
目标\u列表、选项、运行\u元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第1035行,在
提升类型(e)(节点定义、操作、消息)
NotFoundError:执行HTTP请求时出错(HTTP响应代码404,错误代码0,错误消息“”)
阅读gs://project-166422-ml/User/flowers\u User\u 20170522\u 121407/preproc/eval时
[[Node:ReaderReadUpTv2=ReaderReadUpTv2[\u device=“/job:localhost/replica:0/task:0/cpu:0”](TFRecordReaderV2,输入\u生产者,ReaderReadUpTv2/num\u记录)]]
由op u'ReaderReadUpToV2'引起,定义为:
文件“/usr/lib/python2.7/runpy.py”,第162行,在运行模块中作为主模块
“\uuuuu main\uuuuuuuuuuuuuuuuuuuuuuuuu”,fname,loader,pkg\u name)
文件“/usr/lib/python2.7/runpy.py”,第72行,在运行代码中
run_globals中的exec代码
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第542行,在
tf.app.run()
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第44行,正在运行
_系统出口(主(_sys.argv[:1]+标志_passthrough))
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第305行,主目录
运行(模型,argv)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第436行,运行中
分派(参数、模型、群集、任务)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第477行,发送中
培训师(参数、模型、集群、任务)。运行_培训()
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第241行,运行中培训
自我评估(会议)
eval中第283行的文件“/root/.local/lib/python2.7/site packages/trainer/task.py”
self.model.format\u metric\u值(self.evaluator.evaluate())
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第57行,在评估中
自我评估(批量大小)
文件“/root/.local/lib/python2.7/site packages/trainer/model.py”,第310行,内建评估图
返回self.build\u图(数据路径、批大小、GraphMod.EVALUATE)
文件“/root/.local/lib/python2.7/site packages/trainer/model.py”,第231行,内置图
num_epochs=无(如果是2)
文件“/root/.local/lib/python2.7/site packages/trainer/util.py”,第52行,在read_示例中
文件名(队列、批大小)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_-ops.py”,第226行,读至
名称=名称)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py”,第380行,在_reader_read_up_至_v2中
num_records=num_records,name=name)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/framework/op_def_library.py”,第763行,在apply_op
op_def=op_def)
文件“/usr/local/lib/python2.7/dist-packages/t
gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval