Python Tensorflow distributed:CreateSession仍在等待工作人员的响应：/job:ps/副本：0/任务：0_Python_Tensorflow_Deep Learning_Distributed

Python Tensorflow distributed:CreateSession仍在等待工作人员的响应：/job:ps/副本：0/任务：0

python tensorflow deep-learning

Python Tensorflow distributed:CreateSession仍在等待工作人员的响应：/job:ps/副本：0/任务：0,python,tensorflow,deep-learning,distributed,Python,Tensorflow,Deep Learning,Distributed,我正在尝试使用TF运行我的第一个分布式训练示例。我使用了TF文档中的示例，在不同的集群上各有一个ps和一个worker。然而，我总是得到CreateSession，仍然在等待工作集群上的worker:/job:ps/replica:0/task:0的响应，如下所示 trainer.py import argparse import sys import tensorflow as tf FLAGS = None def main(_): ps_hosts = FLAGS.ps_hos

我正在尝试使用TF运行我的第一个分布式训练示例。我使用了TF文档中的示例，在不同的集群上各有一个ps和一个worker。然而，我总是得到

CreateSession，仍然在等待工作集群上的worker:/job:ps/replica:0/task:0

的响应，如下所示

trainer.py

import argparse
import sys

import tensorflow as tf

FLAGS = None


def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # input images
      with tf.name_scope('input'):
        # None -> batch size can be any size, 784 -> flattened mnist image
        x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input")
        # target 10 output classes
        y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")

      # model parameters will change during training so we use tf.Variable
      tf.set_random_seed(1)
      with tf.name_scope("weights"):
        W1 = tf.Variable(tf.random_normal([784, 100]))
        W2 = tf.Variable(tf.random_normal([100, 10]))

      # bias
      with tf.name_scope("biases"):
        b1 = tf.Variable(tf.zeros([100]))
        b2 = tf.Variable(tf.zeros([10]))

      # implement model
      with tf.name_scope("softmax"):
        # y is our prediction
        z2 = tf.add(tf.matmul(x,W1),b1)
        a2 = tf.nn.sigmoid(z2)
        z3 = tf.add(tf.matmul(a2,W2),b2)
        y  = tf.nn.softmax(z3)

      # specify cost function
      with tf.name_scope('cross_entropy'):
        # this is our cost
        loss = tf.reduce_mean(
                  -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

      global_step = tf.contrib.framework.get_or_create_global_step()

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

    # The StopAtStepHook handles stopping after running given steps.
    hooks=[tf.train.StopAtStepHook(last_step=1000000)]

    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(FLAGS.task_index == 0),
                                           checkpoint_dir="/tmp/train_logs",
                                           hooks=hooks) as mon_sess:
      while not mon_sess.should_stop():
        # Run a training step asynchronously.
        # See <a href="../api_docs/python/tf/train/SyncReplicasOptimizer"><code>tf.train.SyncReplicasOptimizer</code></a> for additional details on how to
        # perform *synchronous* training.
        # mon_sess.run handles AbortedError in case of preempted PS.
        mon_sess.run(train_op)


if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  # Flags for defining the tf.train.ClusterSpec
  parser.add_argument(
      "--ps_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--worker_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--job_name",
      type=str,
      default="",
      help="One of 'ps', 'worker'"
  )
  # Flags for defining the tf.train.Server
  parser.add_argument(
      "--task_index",
      type=int,
      default=0,
      help="Index of task within the job"
  )
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

ps簇

$ python trainer.py --ps_hosts=<ps-ipaddress>:2222 --worker_hosts=<worker-ipaddress>:2222 --job_name=ps --task_index=0
2018-07-06 21:52:34.495508: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-06 21:52:34.802537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.17GiB freeMemory: 6.98GiB
2018-07-06 21:52:35.129511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 6.98GiB
2018-07-06 21:52:35.130066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-07-06 21:52:36.251900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-06 21:52:36.252045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 
2018-07-06 21:52:36.252058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y 
2018-07-06 21:52:36.252067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N 
2018-07-06 21:52:36.252770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 6754 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
2018-07-06 21:52:36.357351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 6754 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-07-06 21:52:36.468733: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-06 21:52:36.468788: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job target -> {0 -> <ps-ipaddress>:2222}
2018-07-06 21:52:36.468801: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> <worker-ipaddress>:2222}
2018-07-06 21:52:36.506840: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:2222

$python trainer.py--ps_hosts=：2222--worker_hosts=：2222--job_name=ps--task_index=0
2018-07-06 21:52:34.495508:I tensorflow/core/platform/cpu\u feature\u guard.cc:140]您的cpu支持该tensorflow二进制文件未编译使用的指令：AVX2 FMA
2018-07-06 21:52:34.802537:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356]找到了具有以下属性的设备0：
名称：特斯拉K80大调：3小调：7 memoryClockRate（GHz）：0.8235
pciBusID:0000:05:00.0
总内存：11.17GiB自由内存：6.98GiB
2018-07-06 21:52:35.129511:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356]找到了具有以下属性的设备1：
名称：特斯拉K80大调：3小调：7 memoryClockRate（GHz）：0.8235
pciBusID:0000:06:00.0
总内存：11.17GiB自由内存：6.98GiB
2018-07-06 21:52:35.130066:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435]添加可见gpu设备：0,1
2018-07-06 21:52:36.251900:I tensorflow/core/common_runtime/gpu/gpu_device.cc:923]设备互连拖缆执行器与强度1边缘矩阵：
2018-07-06 21:52:36.252045:I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]01
2018-07-06 21:52:36.252058:I tensorflow/core/common_runtime/gpu/gpu_device.cc:942]0:N Y
2018-07-06 21:52:36.252067:I tensorflow/core/common_runtime/gpu/gpu_device.cc:942]1:Y N
2018-07-06 21:52:36.252770:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053]创建tensorflow设备（/job:ps/replica:0/task:0/device:gpu:0，内存6754MB）->物理gpu（设备：0，名称：特斯拉K80，pci总线id:0000:05:00.0，计算能力：3.7）
2018-07-06 21:52:36.357351:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053]创建tensorflow设备（/job:ps/replica:0/task:0/device:gpu:1，内存6754MB）>物理gpu（设备：1，名称：特斯拉K80，pci总线id:0000:06:00.0，计算能力：3.7）
2018-07-06 21:52:36.468733:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]为作业ps初始化GrpcChannelCache->{0->localhost:2222}
2018-07-06 21:52:36.468788:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]为作业目标初始化GrpcChannelCache->{0->:2222}
2018-07-06 21:52:36.468801:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]为作业工人初始化GrpcChannelCache->{0->:2222}
2018-07-06 21:52:36.506840:I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]已启动目标为的服务器：grpc://localhost:2222

工人呢

$ python trainer.py --ps_hosts=<ps-ipaddress>:2222 --worker_hosts=<worker-ipaddress>:2222 --job_name=worker --task_index=0
2018-07-06 21:55:13.276064: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-06 21:55:17.948796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-06 21:55:18.082286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-07-06 21:55:18.082538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-07-06 21:55:18.591166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-06 21:55:18.591218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 
2018-07-06 21:55:18.591227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y 
2018-07-06 21:55:18.591232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N 
2018-07-06 21:55:18.591751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
2018-07-06 21:55:18.696213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 10764 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-07-06 21:55:18.801080: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> <ps-ipaddress>:2222}
2018-07-06 21:55:18.801134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
2018-07-06 21:55:18.809115: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:2222
WARNING:tensorflow:From mnist_distributed.py:62: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2018-07-06 21:55:31.532416: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-06 21:55:41.532559: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

$python trainer.py--ps_hosts=：2222--worker_hosts=：2222--job_name=worker--task_index=0
2018-07-06 21:55:13.276064:I tensorflow/core/platform/cpu_feature_guard.cc:140]您的cpu支持该tensorflow二进制文件未编译使用的指令：AVX2 FMA
2018-07-06 21:55:17.948796:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356]找到了具有以下属性的设备0：
名称：特斯拉K80大调：3小调：7 memoryClockRate（GHz）：0.8235
pciBusID:0000:05:00.0
totalMemory:11.17GiB自由内存：11.10GiB
2018-07-06 21:55:18.082286:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356]找到了具有以下属性的设备1：
名称：特斯拉K80大调：3小调：7 memoryClockRate（GHz）：0.8235
pciBusID:0000:06:00.0
totalMemory:11.17GiB自由内存：11.10GiB
2018-07-06 21:55:18.082538:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435]添加可见gpu设备：0,1
2018-07-06 21:55:18.591166:I tensorflow/core/common_runtime/gpu/gpu_device.cc:923]设备互连拖缆执行器，带强度1边缘矩阵：
2018-07-06 21:55:18.591218:I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]01
2018-07-06 21:55:18.591227:I tensorflow/core/common_runtime/gpu/gpu_device.cc:942]0:N Y
2018-07-06 21:55:18.591232:I tensorflow/core/common_runtime/gpu/gpu_device.cc:942]1:Y N
2018-07-06 21:55:18.591751:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053]创建tensorflow设备（/job:worker/replica:0/task:0/device:gpu:0，带10764MB内存）>物理gpu（设备：0，名称：特斯拉K80，pci总线id:0000:05:00.0，计算能力：3.7）
2018-07-06 21:55:18.696213:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053]创建tensorflow设备（/job:worker/replica:0/task:0/device:gpu:1，10764MB内存）>物理gpu（设备：1，名称：特斯拉K80，pci总线id:0000:06:00.0，计算能力：3.7）
2018-07-06 21:55:18.801080:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]为作业ps->{0->:2222}初始化GrpcChannelCache
2018-07-06 21:55:18.801134:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]为作业工人初始化GrpcChannelCache->{0->localhost:2222}
2018-07-06 21:55:18.809115:I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332]已启动目标为的服务器：grpc://localhost:2222
警告：tensorflow:From mnist_distributed.py:62:get_或_create_global_步骤（来自tensorflow.contrib.framework.python.ops.variables）已弃用，将在未来版本中删除。
更新说明：
请切换到tf.train.get\u或\u create\u global\u步骤
2018-07-06 21:55:31.532416:I tensorflow/core/distributed_runtime/master.cc:221]CreateSession仍在等待工作者的响应：/job:ps/replica:0/任务：0
2018-07-06 21:55:41.532559:I tensorflow/core/distributed_runtime/master.cc:221]CreateSession仍在等待工作者的响应：/job:ps/replica:0/task:0

和

都将替换为实际地址。不确定这是否有问题，但这些地址位于远程cl中