如何运行tensorflow分布式mnist示例

如何运行tensorflow分布式mnist示例,tensorflow,distributed,deep-learning,Tensorflow,Distributed,Deep Learning,我是分布式tensorflow的新手。我在这里找到了这个分布式mnist测试: 但我不知道如何让它运行。我使用了以下脚本: python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=0 --worker_grpc_url="grpc://tf-worker0:2222"\ & python distributed_mnist.py --num_workers=3

我是分布式tensorflow的新手。我在这里找到了这个分布式mnist测试:

但我不知道如何让它运行。我使用了以下脚本:

  python distributed_mnist.py  --num_workers=3 --num_parameter_servers=1 --worker_index=0 --worker_grpc_url="grpc://tf-worker0:2222"\
  & python distributed_mnist.py  --num_workers=3 --num_parameter_servers=1 --worker_index=1 --worker_grpc_url="grpc://tf-worker1:2222"\
  & python distributed_mnist.py  --num_workers=3 --num_parameter_servers=1 --worker_index=2 --worker_grpc_url="grpc://tf-worker2:2222"
我刚刚发现缺少这些参数,所以我将它们传递给程序。发生的情况如下:

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Worker GRPC URL: grpc://tf-worker0:2222
Worker index = 0
Number of workers = 3
Worker GRPC URL: grpc://tf-worker2:2222
Worker index = 2
Number of workers = 3
Worker GRPC URL: grpc://tf-worker1:2222
Worker index = 1
Number of workers = 3
Worker 0: Initializing session...
Worker 2: Waiting for session to be initialized...
Worker 1: Waiting for session to be initialized...
E0608 20:37:13.514249023    7501 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.514287961    7501 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
E0608 20:37:13.548052986    7502 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.548091527    7502 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
E0608 20:37:13.555449386    7503 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.555473898    7503 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
^CE0608 20:37:28.517451603    7504 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.517491102    7504 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
E0608 20:37:28.551002331    7505 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.551029795    7505 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
E0608 20:37:28.556681378    7506 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.556709728    7506 dns_resolver.c:189]         dns resolution failed: retrying in 15 seconds
有人知道如何正确运行它吗?非常感谢

命令行中
--worker\u grpc\u url
标志的值指的是不存在的地址

这是为在特定的Kubernetes环境中运行而设计的,而不是独立的。尤其是
tf-worker0:2222
tf-worker1:2222
,以及
tf-worker2:2222
指的是通过本测试的自动版本创建的Kubernetes容器的名称。作为一个独立的测试,它需要进行相当大的更改

分布式TensorFlow的文档包括。在分布式TensorFlow上尝试MNIST的最简单方法是将模型粘贴到模板中。例如,类似于以下内容的操作应该有效:

导入数学
导入tensorflow作为tf
从tensorflow.examples.tutorials.mnist导入输入数据
#用于定义tf.train.ClusterSpec的标志
tf.app.flags.DEFINE_字符串(“ps_hosts”),“”,
“以逗号分隔的主机名列表:端口对”)
tf.app.flags.DEFINE_字符串(“worker_hosts”),“”,
“以逗号分隔的主机名列表:端口对”)
#用于定义tf.train.Server的标志
tf.app.flags.DEFINE_字符串(“作业名称”、“工作人员”、“ps”之一)
tf.app.flags.DEFINE_integer(“任务索引”,0,“作业内任务索引”)
tf.app.flags.DEFINE_integer(“隐藏单位”),100,
“NN隐藏层中的单元数”)
tf.app.flags.DEFINE_字符串(“data_dir”,“/tmp/mnist data”,
“存储mnist数据的目录”)
tf.app.flags.DEFINE_integer(“批次大小”,100,“培训批次大小”)
FLAGS=tf.app.FLAGS.FLAGS
图像像素=28
def main(ux):
ps_hosts=FLAGS.ps_hosts.split(“,”)
worker\u hosts=FLAGS.worker\u hosts.split(“,”)
#从参数server和worker hosts创建集群。
cluster=tf.train.ClusterSpec({“ps”:ps_hosts,“worker”:worker_hosts})
#为本地任务创建并启动服务器。
服务器=tf.train.server(集群,
job\u name=FLAGS.job\u name,
任务索引=标志。任务索引)
如果FLAGS.job_name==“ps”:
server.join()
elif FLAGS.job_name==“worker”:
#默认情况下,将ops分配给本地工作人员。
带tf.装置(tf.train.replica\u装置\u setter(
worker\u device=“/job:worker/task:%d”%FLAGS.task\u索引,
cluster=cluster):
#隐藏层的变量
hid_w=tf.变量(
tf.截断的\u法线([图像像素*图像像素,标志。隐藏的\u单位],
stddev=1.0/图像像素),name=“hid\u w”)
hid_b=tf.Variable(tf.zeros([FLAGS.hidden_units]),name=“hid_b”)
#softmax层的变量
sm_w=tf.变量(
tf.截断的_正常([FLAGS.hidden_单位,10],
stddev=1.0/math.sqrt(FLAGS.hidden_单位)),
name=“sm_w”)
sm_b=tf.变量(tf.零([10]),name=“sm_b”)
x=tf.placeholder(tf.float32,[无,图像像素*图像像素])
y=tf.占位符(tf.float32,[None,10])
hid_lin=tf.nn.xw_+b(x,hid_w,hid_b)
hid=tf.nn.relu(hid_-lin)
y=tf.nn.softmax(tf.nn.xw_+b(hid,sm_w,sm_b))
损耗=-tf.reduce\u和(y*tf.log(tf.clip\u乘以值(y,1e-10,1.0)))
全局_步长=tf.变量(0)
列车运行=tf.train.ADAGRAD优化器(0.01)。最小化(
损失,全局步=全局步)
saver=tf.train.saver()
summary\u op=tf.summary.merge\u all()
初始化所有变量()
#创建一名“主管”,负责监督培训过程。
sv=tf.train.Supervisor(is_主管=(FLAGS.task_索引==0),
logdir=“/tmp/train_logs”,
init_op=init_op,
summary_op=summary_op,
储蓄者=储蓄者,
全局步=全局步,
保存(型号(秒=600)
mnist=输入数据。读取数据集(FLAGS.data\u dir,one\u hot=True)
#主管负责会话初始化,从恢复
#检查点,并在完成或发生错误时关闭。
将sv.managed_会话(server.target)作为sess:
#循环,直到主管关闭或完成1000000个步骤。
步长=0
如果不是sv,则应停止()并执行<1000000步:
#异步运行培训步骤。
#请参阅“tf.train.syncReplicateSOptimizer”,了解有关如何
#执行“同步”培训。
批次X,批次Y=mnist.train.next_批次(FLAGS.batch_size)
序列馈送={x:batch_xs,y:batch_ys}
_,step=sess.run([train\u op,global\u step],feed\u dict=train\u feed)
如果步骤%100==0:
打印“完成步骤%d”%step
#要求停止所有服务。
sv.stop()
如果名称=“\uuuuu main\uuuuuuuu”:
tf.app.run()

非常感谢您的解释!我会试试的。非常感谢!只是把我的2分钱贡献给你的精彩帖子。tf.merge\u all\u summaries()似乎已弃用,并且在最新版本中使用tf.merge\u all或tf.contrib.deprecated.merge\u all时出现错误_summaries@sunilmanikani谢谢你指出。。。我已将代码更新为使用
tf.summary.merge_all()
.FYI,tf.train.Supervisor现在已被弃用,应改用tf.train.MonitoredTrainingSession。看。我对代码的理解是,每个工人都在读取完整的输入数据;输入数据没有分区。我的理解正确吗?