Deep learning Horovod在GPU集群上为深度学习模型编写的一个简单的分布式python训练程序_Deep Learning_Gpu_Databricks_Horovod_Distributed Training

Deep learning Horovod在GPU集群上为深度学习模型编写的一个简单的分布式python训练程序

deep-learning

Deep learning Horovod在GPU集群上为深度学习模型编写的一个简单的分布式python训练程序,deep-learning,gpu,databricks,horovod,distributed-training,Deep Learning,Gpu,Databricks,Horovod,Distributed Training,我正在尝试运行一些python3代码示例在databricks GPU集群上（具有1个驱动程序和2个工作程序） Databricks环境： ML 6.6, scala 2.11, Spark 2.4.5, GPU 用于分布式深度学习模型训练我刚开始尝试了一个非常简单的例子： from sparkdl import HorovodRunner hr = HorovodRunner(np=2) def train(): print('in train') import

我正在尝试运行一些python3代码示例在databricks GPU集群上（具有1个驱动程序和2个工作程序）

Databricks环境：

 ML 6.6, scala 2.11, Spark 2.4.5, GPU

用于分布式深度学习模型训练

我刚开始尝试了一个非常简单的例子：

 from sparkdl import HorovodRunner
 hr = HorovodRunner(np=2)

 def train():
   print('in train')
   import tensorflow as tf
   print('after import tf')
   hvd.init()
   print('done')

 hr.run(train)

但是，命令始终在运行，没有任何进展

HorovodRunner will stream all training logs to notebook cell output. If there are too many 
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log  callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.

### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod  Timeline. 
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable  to the location 
of the
timeline file to be created. You can then open the timeline file  using the chrome://tracing
facility of the Chrome browser.

我是否错过了一些东西，或者需要设置一些东西来让它工作

谢谢