Python 采用Tensorflow的慢速配电系统_Python_Tensorflow_Distributed_Tensorflow Gpu

Python 采用Tensorflow的慢速配电系统

python tensorflow

Python 采用Tensorflow的慢速配电系统,python,tensorflow,distributed,tensorflow-gpu,Python,Tensorflow,Distributed,Tensorflow Gpu,我正在Tensorflow的一个项目中使用一个大型模型（一个不适合4 gb VGA的模型）。 Tl；dr：在CPU上运行模型的一部分，在GPU上运行另一部分，每批大约需要4秒。我们正在制作一个分配系统，该系统应该可以在2-3台计算机上工作，我们希望以一种最终将加快进程的方式分配任务详情：由于tensorflow中缺少适当的文档（或其他教程/指南），我们面临着很多问题。我们能够构建的最佳分发系统正在运行：1个ps和1个worker（worker的一部分在CPU上，另一部分在GPU上），每批大

我正在Tensorflow的一个项目中使用一个大型模型（一个不适合4 gb VGA的模型）。 Tl；dr：在CPU上运行模型的一部分，在GPU上运行另一部分，每批大约需要4秒。我们正在制作一个分配系统，该系统应该可以在2-3台计算机上工作，我们希望以一种最终将加快进程的方式分配任务

详情：由于tensorflow中缺少适当的文档（或其他教程/指南），我们面临着很多问题。我们能够构建的最佳分发系统正在运行：1个ps和1个worker（worker的一部分在CPU上，另一部分在GPU上），每批大约需要6秒。然后我们尝试了另一种设置：1个ps，2个worker（每个worker使用4gb VGA），我们达到的最佳时间是7秒/批，最后一种设置是在同一台计算机上2个ps，2个worker，但是每个worker运行整个模型，因此每个worker的训练批都不同

args.cluster = tf.train.ClusterSpec({"ps": args.ps_hosts.split(","), "worker": args.worker_hosts.split(",")})
args.server = tf.train.Server(args.cluster,job_name=args.job_name,task_index=args.task_index)
if(args.job_name=="ps"):
    server.join()
else:
    with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % self.task_index,cluster=cluster)):

#Rest of code
.....
# Part where I divide the half on the cpu and half on the gpu:
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d/cpu:0" % self.task_index,cluster=cluster)):
            logger.write("First half gradient on  cpu")
            testGradient2 = tf.gradients(self.cost, tvars[len(tvars)/2:])
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d/gpu:0" % self.task_index,cluster=cluster)):
            logger.write("Second half gradient on gpu")
            testGradient1 = tf.gradients(self.cost, tvars[:len(tvars)/2])
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % self.task_index,cluster=cluster)):

        testGradient = testGradient1+testGradient2
....
#Supervisor part and configuration and session setup 
sv = tf.train.Supervisor(is_chief=(self.task_index == 0), init_op=tf.global_variables_initializer())
config = tf.ConfigProto(allow_soft_placement = True)
self.sess = sv.prepare_or_wait_for_session(server.target,config=config)

正如我所说的，这段代码每批大约运行6.5秒，运行在2个ps（在同一台计算机上）和2个Worker上，是否有我遗漏的优化或要点