Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/extjs/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Tensorflow 成功完成1000次后,云ML上的作业失败_Tensorflow_Google App Engine_Google Cloud Platform_Tensorflow Serving_Google Cloud Ml - Fatal编程技术网

Tensorflow 成功完成1000次后,云ML上的作业失败

Tensorflow 成功完成1000次后,云ML上的作业失败,tensorflow,google-app-engine,google-cloud-platform,tensorflow-serving,google-cloud-ml,Tensorflow,Google App Engine,Google Cloud Platform,Tensorflow Serving,Google Cloud Ml,我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-start-training-prediction,在该教程中,工作是成功的。然而,当我浏览有关花卉图像数据的本教程时:根据日志中1000个步骤的完成情况,我的培训任务似乎成功了。然而,从这个快照完成后,它说作业失败了。我曾尝试使用相同的结构替换普查数据演练中的命令行参数,删除并重新创建了JOB_ID和--output_path用户参数,使用了标准的_

我已经浏览了关于人口普查数据的cloudML教程:cloud.google.com/ml-engine/docs/how-tos/getting-start-training-prediction,在该教程中,工作是成功的。然而,当我浏览有关花卉图像数据的本教程时:根据日志中1000个步骤的完成情况,我的培训任务似乎成功了。然而,从这个快照完成后,它说作业失败了。我曾尝试使用相同的结构替换普查数据演练中的命令行参数,删除并重新创建了JOB_ID和--output_path用户参数,使用了标准的_1 scale层,但没有效果。如果我能从社区得到任何帮助,我将不胜感激。谢谢

下面是错误,您可以看到在日志快照末尾弹出的错误:

{
 textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
    return metric_values
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
    sess.run(enqueue_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
    self.eval_batch_size)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
    return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
    num_epochs=None if is_training else 2)
  File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
    filename_queue, batch_size)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
    num_records=num_records, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***
{
textPayload:“副本主机0以非零状态1退出。终止原因:错误。
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python2.7/runpy.py”,第162行,在运行模块中作为主模块
“\uuuuu main\uuuuuuuuuuuuuuuuuuuuuuuuu”,fname,loader,pkg\u name)
文件“/usr/lib/python2.7/runpy.py”,第72行,在运行代码中
run_globals中的exec代码
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第542行,在
tf.app.run()
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第44行,正在运行
_系统出口(主(_sys.argv[:1]+标志_passthrough))
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第305行,主目录
运行(模型,argv)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第436行,运行中
分派(参数、模型、群集、任务)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第477行,发送中
培训师(参数、模型、集群、任务)。运行_培训()
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第241行,运行中培训
自我评估(会议)
eval中第283行的文件“/root/.local/lib/python2.7/site packages/trainer/task.py”
self.model.format\u metric\u值(self.evaluator.evaluate())
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第95行,在评估中
返回度量值
文件“/usr/lib/python2.7/contextlib.py”,第35行,在__
self.gen.throw(类型、值、回溯)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/supervisor.py”,第960行,在托管会话中
self.stop(close\u summary\u writer=close\u summary\u writer)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py”,第788行,在stop中
停止\宽限期\秒=自我。\停止\宽限期\秒)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py”,第386行,在join中
六、重新提升(*自我执行信息提升)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/training/queue\u runner\u impl.py”,第234行,in\u run
sess.run(排队)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第767行,正在运行
运行_元数据_ptr)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第965行,正在运行
提要(dict字符串、选项、运行元数据)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第1015行,运行
目标\u列表、选项、运行\u元数据)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/client/session.py”,第1035行,在
提升类型(e)(节点定义、操作、消息)
NotFoundError:执行HTTP请求时出错(HTTP响应代码404,错误代码0,错误消息“”)
阅读gs://project-166422-ml/User/flowers\u User\u 20170522\u 121407/preproc/eval时
[[Node:ReaderReadUpTv2=ReaderReadUpTv2[\u device=“/job:localhost/replica:0/task:0/cpu:0”](TFRecordReaderV2,输入\u生产者,ReaderReadUpTv2/num\u记录)]]
由op u'ReaderReadUpToV2'引起,定义为:
文件“/usr/lib/python2.7/runpy.py”,第162行,在运行模块中作为主模块
“\uuuuu main\uuuuuuuuuuuuuuuuuuuuuuuuu”,fname,loader,pkg\u name)
文件“/usr/lib/python2.7/runpy.py”,第72行,在运行代码中
run_globals中的exec代码
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第542行,在
tf.app.run()
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第44行,正在运行
_系统出口(主(_sys.argv[:1]+标志_passthrough))
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第305行,主目录
运行(模型,argv)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第436行,运行中
分派(参数、模型、群集、任务)
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第477行,发送中
培训师(参数、模型、集群、任务)。运行_培训()
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第241行,运行中培训
自我评估(会议)
eval中第283行的文件“/root/.local/lib/python2.7/site packages/trainer/task.py”
self.model.format\u metric\u值(self.evaluator.evaluate())
文件“/root/.local/lib/python2.7/site packages/trainer/task.py”,第57行,在评估中
自我评估(批量大小)
文件“/root/.local/lib/python2.7/site packages/trainer/model.py”,第310行,内建评估图
返回self.build\u图(数据路径、批大小、GraphMod.EVALUATE)
文件“/root/.local/lib/python2.7/site packages/trainer/model.py”,第231行,内置图
num_epochs=无(如果是2)
文件“/root/.local/lib/python2.7/site packages/trainer/util.py”,第52行,在read_示例中
文件名(队列、批大小)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_-ops.py”,第226行,读至
名称=名称)
文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py”,第380行,在_reader_read_up_至_v2中
num_records=num_records,name=name)
文件“/usr/local/lib/python2.7/dist packages/tensorflow/python/framework/op_def_library.py”,第763行,在apply_op
op_def=op_def)
文件“/usr/local/lib/python2.7/dist-packages/t
gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval