Python 为什么分布式TensorFlow玩具示例花费的时间太长？_Python_Tensorflow_Distributed Computing_Tensorflow Gpu

Python 为什么分布式TensorFlow玩具示例花费的时间太长？

python tensorflow

Python 为什么分布式TensorFlow玩具示例花费的时间太长？,python,tensorflow,distributed-computing,tensorflow-gpu,Python,Tensorflow,Distributed Computing,Tensorflow Gpu,我试着运行一个玩具示例，使用分布式TensorFlow实现一些矩阵乘法和加法我的目标是计算（A^n+B^n）其中A[，]和B[，]是LxL矩阵我使用公共云上的两台机器在一台机器上计算A^n，在第二台机器上计算B^n，而不是在第一台机器上计算加法当机器只有CPU时，我的脚本工作得很好。当两者都有GPU时，它无法在合理的时间内运行！它有一个巨大的延迟我的问题-我的脚本做错了什么请注意，对于machine2（task:1），我使用了server.join（），并使用machine1（tas

我试着运行一个玩具示例，使用分布式TensorFlow实现一些矩阵乘法和加法

我的目标是计算（A^n+B^n）
其中A[，]
和B[，]
是LxL
矩阵

我使用公共云上的两台机器在一台机器上计算

A^n

，在第二台机器上计算

B^n

，而不是在第一台机器上计算加法

当机器只有CPU时，我的脚本工作得很好。
当两者都有GPU时，它无法在合理的时间内运行！它有一个巨大的延迟

我的问题-我的脚本做错了什么

请注意，对于machine2（

task:1

），我使用了

server.join（）

，并使用machine1（

task:0

）作为此with in图中的客户端

#------------------------------------------------------------------
from zmq import Stopwatch; aClk_E2E = Stopwatch(); aClk_E2E.start()
#------------------------------------------------------------------
from __future__ import print_function
import numpy as np
import tensorflow as tf
import datetime

IP_1 = '10.132.0.2';     port_1 = '2222'
IP_2 = '10.132.0.3';     port_2 = '2222'

cluster = tf.train.ClusterSpec( { "local": [ IP_1 + ":" + port_1,
                                             IP_2 + ":" + port_2
                                             ],
                                   }
                                )
server = tf.train.Server( cluster,
                          job_name   = "local",
                          task_index = 0
                          )
# server.join() # @machine2 ( task:1 )

n =    5
L = 1000  

def matpow( M, n ):
    if n < 1:                 # Abstract cases where n < 1
        return M
    else:
        return tf.matmul( M, matpow( M, n - 1 ) )

G = tf.Graph()

with G.as_default():
     with tf.device( "/job:local/task:1/cpu:0" ):
          c1 = []
          tB = tf.placeholder( tf.float32, [L, L] )     # tensor B placeholder
          with tf.device( "/job:local/task:1/gpu:0" ):
               c1.append( matpow( tB, n ) )

     with tf.device( "/job:local/task:0/cpu:0" ):
          c2 = []
          tA = tf.placeholder( tf.float32, [L, L] )     # tensor A placeholder
          with tf.device( "/job:local/task:0/gpu:0" ):
               c2.append( matpow( tA, n ) )
          sum2 = tf.add_n( c1 + c2 )
#---------------------------------------------------------<SECTION-UNDER-TEST>
t1_2 = datetime.datetime.now()
with tf.Session( "grpc://" + IP_1 + ":" + port_1, graph = G ) as sess:
     A = np.random.rand( L, L ).astype( 'float32' )
     B = np.random.rand( L, L ).astype( 'float32' )
     sess.run( sum2, { tA: A, tB: B, } )
t2_2 = datetime.datetime.now()
#---------------------------------------------------------<SECTION-UNDER-TEST>

#------------------------------------------------------------------
_ = aClk_E2E.stop()
#------------------------------------------------------------------
print( "Distributed Computation time: " + str(t2_2 - t1_2))
print( "Distributed Experiment  took: {0: > 16d} [us] End-2-End.".format( _ ) )

#------------------------------------------------------------------
从zmq进口秒表；aClk_E2E=秒表（）；aClk_E2E.start（）
#------------------------------------------------------------------
来自未来导入打印功能
将numpy作为np导入
导入tensorflow作为tf
导入日期时间
IP_1='10.132.0.2'；端口_1='2222'
IP_2='10.132.0.3'；端口2='2222'
cluster=tf.train.ClusterSpec（{“本地”：[IP_1+”：“+port_1，
IP_2+“：”+端口_2
],
}
)
服务器=tf.train.server（集群，
作业名称=“本地”，
任务索引=0
)
#server.join（）#@machine2（任务：1）
n=5
L=1000
def matpow（M，n）：
如果n<1:#抽象案例，其中n<1
返回M
其他：
返回tf.matmul（M，matpow（M，n-1））
G=tf.Graph（）
使用G.as_default（）：
使用tf.device（“/job:local/task:1/cpu:0”）：
c1=[]
tB=tf.占位符（tf.float32[L，L]）#张量B占位符
使用tf.device（“/job:local/task:1/gpu:0”）：
c1.附加（matpow（tB，n））
使用tf.device（“/job:local/task:0/cpu:0”）：
c2=[]
tA=tf.占位符（tf.float32[L，L]）#张量A占位符
使用tf.device（“/job:local/task:0/gpu:0”）：
c2.追加（matpow（tA，n））
sum2=tf.加法（c1+c2）
#---------------------------------------------------------
t1_2=datetime.datetime.now（）
将tf.Session（“grpc://“+IP_1+”：“+port_1，graph=G）作为sess：
A=np.random.rand（L，L）.astype（'float32'）
B=np.random.rand（L，L）.astype（'float32'）
sess.run（sum2，{tA:A，tB:B，}）
t2_2=datetime.datetime.now（）
#---------------------------------------------------------
#------------------------------------------------------------------
_=aClk_E2E.stop（）
#------------------------------------------------------------------
打印（“分布式计算时间：+str（t2_2-t1_2））
打印（“分布式实验：{0:>16d}[us]End-2-End.”。格式（）

分布式计算是我们的新天地，或者说是一组并行的天地进入这一领域的第一步总是充满挑战的。失去了确定性，这在以前的经验中被认为是理所当然的，许多新的挑战，在单节点进程协调中没有类似的问题，许多新的惊喜来自新数量级的分布式执行定时，以及分布式（协调，如果不是死锁和/或活锁）阻塞问题

感谢您添加了一些定量事实~15秒对于

A[10001000]来说“太长”；B[10001000]；n=5

——到目前为止还不错

您是否介意添加上述建议的代码更改，并在相同的实际基础架构上重新运行该实验？

这将有助于本已开始工作的其余部分（此处为W.I.p.）

--提前感谢运行+发布更新的事实

很难继续用定量支持的陈述，然而，我的直觉怀疑是这样的：

 def matpow( M, n ):
     return  M if ( n < 1 ) else tf.matmul( M, matpow( M, n - 1 ) )

你介意用一个定量的陈述来更新你的帖子吗，对于一个明确给定的矩阵大小1E+{3,6,9}x1e+{3,6,9}，你认为什么是{超低，足够，太大}延迟，什么是{快，足够，太慢}处理？这对国家来说是公平且科学严谨的，不是吗？我相信我通过从windows机器转向ubuntu机器解决了这个问题。以下是数字——在windows机器上，当L=1000时，大约需要15秒，而在Linux上，不到1秒。sny不知道为什么它不能在windows上正常工作？？？@conflux：第一次调用可能会很慢，因为每个进程初始化一次，你能排除这种可能性吗？您还可以插入tf.Print节点以获得单独的时间戳，并查看慢度发生在哪个阶段。官方版本在windows机器上的速度慢了15倍，这听起来像是一个错误。而且，这种速度慢的原因可能是最近应该修复的问题。您可以在引用的线程中运行玩具基准测试，以查看Windows构建是否没有修复

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________|________________________________________________________________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset