Python 使用多GPU的KERA分布式训练-分段故障（核心转储）_Python_Python 3.x_Tensorflow_Machine Learning_Keras

Python 使用多GPU的KERA分布式训练-分段故障（核心转储）

python python-3.x tensorflow machine-learning keras

Python 使用多GPU的KERA分布式训练-分段故障（核心转储）,python,python-3.x,tensorflow,machine-learning,keras,Python,Python 3.x,Tensorflow,Machine Learning,Keras,我尝试在两个节点上使用分布式kera运行官方示例，每个节点上有一个GPU。我在第一个节点上运行TF_-CONFIG='{“cluster”：{“worker”：[“ip1:2222”，“ip2:2222”]}，“task”：{“index”：0，“type”：“worker”}python3 test.py，在第二个节点上运行TF_-CONFIG='{“cluster”：{“worker”：[“ip1:2222”，“ip2:2222”}，“task”：{“index”：1，“type”：“wor

我尝试在两个节点上使用分布式kera运行官方示例，每个节点上有一个GPU。我在第一个节点上运行

TF_-CONFIG='{“cluster”：{“worker”：[“ip1:2222”，“ip2:2222”]}，“task”：{“index”：0，“type”：“worker”}python3 test.py

，在第二个节点上运行

TF_-CONFIG='{“cluster”：{“worker”：[“ip1:2222”，“ip2:2222”}，“task”：{“index”：1，“type”：“worker”}

。当我打印

device\u lib.list\u local\u devices（）

时，它们都会检测到GPU，但是我得到如下所示的错误。当我在没有

TF_CONFIG

的情况下单独运行它们时，一切都正常工作。你知道怎么了吗

节点1：

2019-11-13 18:20:00.974896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 7658 MB memory) -> physical GPU (device: 0, pci bus id: 0000:84:00.0, compute capability: 3.5)
2019-11-13 18:20:00.977161: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:250] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> ip2:2222}
2019-11-13 18:20:00.981865: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:2222
Segmentation fault (core dumped)

节点2：

2019-11-13 18:20:04.121540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 7659 MB memory) -> physical GPU (device: 0, pci bus id: 0000:84:00.0, compute capability: 3.5)
2019-11-13 18:20:04.123868: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:250] Initialize GrpcChannelCache for job worker -> {0 -> ip1:2222, 1 -> localhost:2222}
2019-11-13 18:20:04.129259: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:2222
Segmentation fault (core dumped)