在分布式tensorFlow(1.0.1)上,当使用syncReplica和MonitoredTrainingSession时,首席工作人员在培训结束时挂起

在分布式tensorFlow(1.0.1)上,当使用syncReplica和MonitoredTrainingSession时,首席工作人员在培训结束时挂起,tensorflow,Tensorflow,在分布式tensorFlow(1.0.1)上,当使用syncReplica和MonitoredTrainingSession时,首席工作人员在培训结束时挂起 需要帮助才能理解我遗漏了什么。另外,如果您需要更多信息,请告诉我 提前谢谢 ClusterConfig: INFO:train_opt:Sync Replica Optimizer Enabled... INFO:train_opt:[1] Training begins @ 1493747578.942078 INFO:train

在分布式tensorFlow(1.0.1)上,当使用syncReplica和MonitoredTrainingSession时,首席工作人员在培训结束时挂起

需要帮助才能理解我遗漏了什么。另外,如果您需要更多信息,请告诉我

提前谢谢

ClusterConfig:

INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.942078  
INFO:train_opt:[1] worker/0 1493747581.577683: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/0 1493747584.819320: training step 200 done with Loss 220.282581  
INFO:train_opt:[1] worker/0 1493747587.935895: training step 400 done with Loss 38.253779  
INFO:train_opt:[1] worker/0 1493747590.975302: training step 600 done with Loss 20.162405  <=== Hangs by end of training  
INFO:train_opt:Using Train Optimizer: Adam  
INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.956051  
INFO:train_opt:[1] worker/1 1493747581.531765: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/1 1493747585.027504: training step 200 done with Loss 196.834690  
INFO:train_opt:[1] worker/1 1493747588.469242: training step 400 done with Loss 31.045701  
INFO:train_opt:[1] worker/1 1493747591.898919: training step 600 done with Loss 16.355974  
INFO:train_opt:[1] Training ends @ 1493747612.044738  
INFO:train_opt:[1] Training elapsed time: 33.088687 s  
INFO:train_opt:FINAL Training Loss:11.364212  <==== Training completed on this worker!!  
PS数量:2
工人人数:2

输出:

INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.942078  
INFO:train_opt:[1] worker/0 1493747581.577683: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/0 1493747584.819320: training step 200 done with Loss 220.282581  
INFO:train_opt:[1] worker/0 1493747587.935895: training step 400 done with Loss 38.253779  
INFO:train_opt:[1] worker/0 1493747590.975302: training step 600 done with Loss 20.162405  <=== Hangs by end of training  
INFO:train_opt:Using Train Optimizer: Adam  
INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.956051  
INFO:train_opt:[1] worker/1 1493747581.531765: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/1 1493747585.027504: training step 200 done with Loss 196.834690  
INFO:train_opt:[1] worker/1 1493747588.469242: training step 400 done with Loss 31.045701  
INFO:train_opt:[1] worker/1 1493747591.898919: training step 600 done with Loss 16.355974  
INFO:train_opt:[1] Training ends @ 1493747612.044738  
INFO:train_opt:[1] Training elapsed time: 33.088687 s  
INFO:train_opt:FINAL Training Loss:11.364212  <==== Training completed on this worker!!  
工人\u 0:

INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.942078  
INFO:train_opt:[1] worker/0 1493747581.577683: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/0 1493747584.819320: training step 200 done with Loss 220.282581  
INFO:train_opt:[1] worker/0 1493747587.935895: training step 400 done with Loss 38.253779  
INFO:train_opt:[1] worker/0 1493747590.975302: training step 600 done with Loss 20.162405  <=== Hangs by end of training  
INFO:train_opt:Using Train Optimizer: Adam  
INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.956051  
INFO:train_opt:[1] worker/1 1493747581.531765: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/1 1493747585.027504: training step 200 done with Loss 196.834690  
INFO:train_opt:[1] worker/1 1493747588.469242: training step 400 done with Loss 31.045701  
INFO:train_opt:[1] worker/1 1493747591.898919: training step 600 done with Loss 16.355974  
INFO:train_opt:[1] Training ends @ 1493747612.044738  
INFO:train_opt:[1] Training elapsed time: 33.088687 s  
INFO:train_opt:FINAL Training Loss:11.364212  <==== Training completed on this worker!!  
INFO:train\u opt:Sync Replica Optimizer已启用。。。
信息:培训选项:[1]培训开始@1493747578.942078
信息:培训选项:[1]工人/0 1493747581.577683:培训步骤0已完成,损失3476.279060
信息:培训选项:[1]工人/0 1493747584.819320:完成培训步骤200,损失220.282581
信息:培训选项:[1]工人/0 14937475887.935895:培训步骤400已完成,损失38.253779
信息:培训选项:[1]工人/0 1493747590.975302:完成培训步骤600,损失20.162405#gpu
#对应机器中的每个工人
gpu=(FLAGS.task_index%FLAGS.num_gpu)
worker\u device=“/job:worker/task:%d/gpu:%d”%(FLAGS.task\u索引,gpu)
elif FLAGS.num_gpus==0:
#只需将CPU分配给工作服务器
cpu=0
worker\u device=“/job:worker/task:%d/cpu:%d”%(FLAGS.task\u索引,cpu)
#设备设置器将自动将变量ops放置在单独的
#参数服务器(ps)。非可变ops将由工人承担。
#ps使用CPU,工人使用相应的GPU
使用tf.device(tf.train.replica_device_setter(worker_device=worker_device,ps_device=“/job:ps/cpu:0”,cluster=cluster)):
#…建立回归模型
损失=。。。
opt=tf.train.AdamOptimizer(学习率=0.01)
#图之间的复制。如果启用,则同步进行训练*
如果FLAGS.sync_replicas==True:
worker\u spec=FLAGS.worker\u hosts.split(“,”)
#获取工人的数量。
num_workers=len(worker_spec)
如果FLAGS.replications\u to\u aggregate为无:
副本\u到\u聚合=工人数量
其他:
副本\u到\u聚合=标志。副本\u到\u聚合
opt=tf.train.SyncReplicasOptimizer(opt,副本到副本聚合=副本到副本聚合,总副本数量=工作人员数量,name=“nn\u sync\u副本”)
训练步骤=选择最小化(损失,全局步骤=全局步骤)
如果FLAGS.sync_replicas==True:
#您可以创建处理初始化和队列的钩子。
sync\u replicas\u hook=opt.make\u session\u run\u hook(is\u chief=is\u chief,num\u tokens=num\u workers)
如果FLAGS.sync_replicas==True:
hooks=[sync\u replications\u hook,tf.train.StopAtStepHook(最后一步=1000)]
其他:
吊钩=[tf.train.StopAtStepHook(最后一步=1000)]
#MonitoredTrainingSession负责会话初始化,
#从检查点还原,保存到检查点,完成后关闭
#或者发生错误。
将tf.train.MonitoredTrainingSession(master=server.target,is_chief=is_chief,hooks=hooks,config=sess\u config)作为sess:
而不是sess。是否应停止()
#运行tensorflow分布式会话以计算损失函数
_,丢失,=self.mon\u sess.run([train\u step,loss,],feed\u dict={self.input\u特性:X\u train.transpose(),self.target\u输出:Y\u train})

只有主管会通过主管队列运行器更新变量,但它应该使用所有可用工作人员的梯度。主任应该等到足够的毕业生被收集到,所以不一定从所有的工人那里

replicas\u to\u aggregate=num\u workers
时,主管将等待所有工人的毕业生

在您的案例中,当worker_1的培训完成时,worker_0(主管)挂起,等待worker_1毕业


您可以通过将
replications\u设置为\u aggregate=1来解决此问题。但我不确定这是否会在所有的程序运行时聚合所有工人的所有毕业生。

任何好的撒玛利亚人,请评论代码有什么问题?为什么首席员工在培训结束前会被绞死?这仅在图之间复制情况下发生。在图形中,它工作得很好。