Python Tensorflow云ML目标检测-分布式训练错误

Python Tensorflow云ML目标检测-分布式训练错误,python,tensorflow,google-cloud-platform,Python,Tensorflow,Google Cloud Platform,我正试图按照Tensorflow的对象检测教程来学习我自己的模型,但我使用的代码与之前完全一样 我在教程中做了一些更改,特别是使用了runtime 1.5,而不是教程中所说的1.2。当我尝试在Google Cloud ML上运行时,没有任何明显的错误(我可以看到),但是任务会在没有培训的情况下快速退出 以下是我用于启动培训作业的命令: gcloud ml-engine jobs submit training object_detection_`date +%s` --job-dir=g

我正试图按照Tensorflow的对象检测教程来学习我自己的模型,但我使用的代码与之前完全一样

我在教程中做了一些更改,特别是使用了runtime 1.5,而不是教程中所说的1.2。当我尝试在Google Cloud ML上运行时,没有任何明显的错误(我可以看到),但是任务会在没有培训的情况下快速退出

以下是我用于启动培训作业的命令:

gcloud ml-engine jobs submit training object_detection_`date +%s`
    --job-dir=gs://test-bucket/training/
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz
    --module-name object_detection.train
    --region us-central1
    --config ./config.yaml
    --
    --train_dir=gs://test-bucket/data/
    --pipeline_config_path=gs://test-bucket/configs/ssd_inception_v2_coco.config
这是我的config.yaml:

trainingInput:
  runtimeVersion: "1.5"
  scaleTier: CUSTOM
  masterType: complex_model_l
  workerCount: 9
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: large_model
最后,我的工作日志结束了:

I  worker-replica-6 Clean up finished.  worker-replica-6
I  worker-replica-7 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-7
I  worker-replica-7 Module completed; cleaning up.  worker-replica-7
I  worker-replica-7 Clean up finished.  worker-replica-7
I  worker-replica-8 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-8
I  worker-replica-8 Module completed; cleaning up.  worker-replica-8
I  worker-replica-8 Clean up finished.  worker-replica-8
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-1
I  worker-replica-1 Module completed; cleaning up.  worker-replica-1
I  worker-replica-1 Clean up finished.  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-3
I  worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-0
I  worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-2
I  worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-5
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-3
I  worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-0
I  worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-2
I  worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-5
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  Finished tearing down TensorFlow. 
I  Job failed.

正如我提到的,我无法从日志中获得有用的东西。再往前一点,我发现这个错误
Master init:Unavailable:Stream removed
,但我不确定如何处理这个错误。感谢您向正确的方向努力

我复制了你的问题。我修正了它如下:

roysheffi在3个月前对这个问题发表了评论。嗨@pkulzc,我想我 可能有线索:

在线357,object_detection/trainer.py呼叫 tf.contrib.slim.learning.train()使用已弃用的 tf.train.Supervisor和应迁移到 tf.train.MonitoredTrainingSession,如中所述 训练主管

tensorflow/tensorflow#15793中已经要求了这一点,并且 报告为tensorflow/tensorflow#17852的解决方案 雅虎/TensorFlowOnSpark的评论#245。[]

因此,最后,我在inside trainer.py中做了如下操作:

  • tf.train.MonitoredTrainingSession(
    代替
    slim.learning.train(