Tensorflow无法在Infiniband网络上使用RDMA
执行:Tensorflow无法在Infiniband网络上使用RDMA,tensorflow,infiniband,rdma,Tensorflow,Infiniband,Rdma,执行: python tf_cnn_benchmarks.py \ --local_parameter_device=gpu \ --num_gpus=1 \ --batch_size=2 \ --model=alexnet \ --variable_update=distributed_replicated \ --job_name=ps \ --ps_hosts=192.168.230.107:50000 \ --worker_
python tf_cnn_benchmarks.py \
--local_parameter_device=gpu \
--num_gpus=1 \
--batch_size=2 \
--model=alexnet \
--variable_update=distributed_replicated \
--job_name=ps \
--ps_hosts=192.168.230.107:50000 \
--worker_hosts=192.168.230.107:60000,192.168.230.108:60000 \
--task_index=0 \
--server_protocol=grpc+verbs
失败,出现以下消息:
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1258, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1248, in main
bench = BenchmarkCNN()
File "tf_cnn_benchmarks.py", line 525, in __init__
protocol=FLAGS.server_protocol)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 145, in __init__
self._server_def.SerializeToString(), status)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No server factory registered for the given ServerDef: cluster {
job {
name: "ps"
tasks {
key: 0
value: "192.168.230.107:50000"
}
}
job {
name: "worker"
tasks {
key: 0
value: "192.168.230.107:60000"
}
tasks {
key: 1
value: "192.168.230.108:60000"
}
}
}
job_name: "ps"
default_session_config {
intra_op_parallelism_threads: 1
gpu_options {
force_gpu_compatible: true
}
allow_soft_placement: true
}
protocol: "grpc+verbs"
我的Tensorflow是Tensorflow gpu:1.4.1版本。
Tensorflow能否在Infiniband卡上使用RDMA?您确定您的Tensorflow支持RDMA吗?如果你想使用grpc+verbs协议,你必须自己为源代码制作tensorflow,并在配置步骤选择RDMA支持。你说得对,我需要自己从源代码编译,而不是使用正式版本。谢谢!
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.40.7000
node_guid: 248a:0703:00df:3ad0
sys_image_guid: 248a:0703:00df:3ad3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1090120019
phys_port_cnt: 2
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 5
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand