Python 2.7 创建会话挂起在分布式tensorflow上
我正在学习tensorflow分布式教程: 我尝试使用python多处理进程,而不是在单独的命令shell中启动脚本。不幸的是,代码挂起在打开会话的阶段 欢迎提出任何意见。我准备了一个非常简单的代码示例,它基本上只启动几个并行进程:Python 2.7 创建会话挂起在分布式tensorflow上,python-2.7,tensorflow,multiprocessing,distributed,Python 2.7,Tensorflow,Multiprocessing,Distributed,我正在学习tensorflow分布式教程: 我尝试使用python多处理进程,而不是在单独的命令shell中启动脚本。不幸的是,代码挂起在打开会话的阶段 欢迎提出任何意见。我准备了一个非常简单的代码示例,它基本上只启动几个并行进程: import tensorflow as tf import time from multiprocessing import Process N_WORKERS = 3 SPEC = {'ps': ['127.0.0.1:12222'], 'worker':
import tensorflow as tf
import time
from multiprocessing import Process
N_WORKERS = 3
SPEC = {'ps': ['127.0.0.1:12222'], 'worker': ['127.0.0.1:12223', '127.0.0.1:12224', '127.0.0.1:12225']}
def run_ps_server():
spec = tf.train.ClusterSpec(SPEC)
ps_server = tf.train.Server(spec, job_name='ps', task_index=0)
ps_server.join()
def run_worker(task):
spec = tf.train.ClusterSpec(SPEC)
server = tf.train.Server(spec, job_name='worker', task_index=task)
with tf.device(tf.train.replica_device_setter(1, worker_device="/job:worker/task:%d" % task)):
global_step = tf.get_variable('global_step', [],
initializer = tf.constant_initializer(0),
trainable = False)
inc_global_step = tf.assign_add(global_step, 1)
init_op = tf.global_variables_initializer()
sv = tf.train.Supervisor(is_chief=(task == 0),
global_step=global_step,
init_op=init_op)
config = tf.ConfigProto(device_filters=["/job:ps", "/job:worker/task:{}/cpu:0".format(task)])
with sv.managed_session(server.target, config=config) as sess, sess.as_default():
print 'task {}, global_step {}'.format(task, sess.run(global_step))
if task == 0:
sess.run(inc_global_step)
elif task == 1:
sess.run(inc_global_step)
sess.run(inc_global_step)
print 'task {}, global_step {}'.format(task, sess.run(global_step))
if task == 2:
sv.stop()
def main(_):
ps_worker = Process(target=run_ps_server, args=())
ps_worker.daemon = True
ps_worker.start()
worker_processes = []
for i in xrange(N_WORKERS):
time.sleep(0.01)
w = Process(target=run_worker, args=(i,))
w.daemon = True
w.start()
worker_processes.append(w)
for w in worker_processes: w.join()
ps_worker.terminate()
if __name__ == '__main__':
tf.app.run()
Python 2.7。Tensorflow 0.12.1(CPU),Mint 17(Ubuntu x64)
编辑:
问题没有在Tensorflow CUDA版本上重现。在此处显示您的代码。在此处添加的代码它只挂起在CUDA版本上,而不挂起在cpu上?如果您运行cuda版本,但设置“导出cuda\u可见\u设备=”,该怎么办。此外,附加gdb和打印堆栈跟踪(bt)也可以帮助解决该问题。事实证明,库达与此无关。一旦我更换了本地计算机上的端口,脚本就可以正常运行。结果表明,问题比我想象的更严重。请参见此处的讨论: