Tensorflow:UnavailableError:增加辅助进程时出现操作系统错误(分布式GPU模式)

Tensorflow:UnavailableError:增加辅助进程时出现操作系统错误(分布式GPU模式),tensorflow,Tensorflow,我正在运行tensorflowOnSpark,其中有1个namenode和4个DataNode 每个数据节点有4个Titan xp(总共16个GPU) 对于每个数据节点,“Hello Tensorflow”的运行方式如下所示 Python 3.5.2 (default, Feb 7 2018, 11:42:44) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more in

我正在运行tensorflowOnSpark,其中有1个namenode和4个DataNode 每个数据节点有4个Titan xp(总共16个GPU)

对于每个数据节点,“Hello Tensorflow”的运行方式如下所示

Python 3.5.2 (default, Feb  7 2018, 11:42:44) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-02-07 20:54:37.894085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 11.91GiB freeMemory: 11.71GiB
2018-02-07 20:54:38.239744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 1 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.587283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 2 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.922914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 3 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:84:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.927574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1221] Device peer to peer matrix
2018-02-07 20:54:38.927719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] DMA: 0 1 2 3 
2018-02-07 20:54:38.927739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 0:   Y Y N N 
2018-02-07 20:54:38.927750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 1:   Y Y N N 
2018-02-07 20:54:38.927760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 2:   N N Y Y 
2018-02-07 20:54:38.927774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 3:   N N Y Y 
2018-02-07 20:54:38.927791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 0
2018-02-07 20:54:38.927804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 1
2018-02-07 20:54:38.927816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 2
2018-02-07 20:54:38.927827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 3
2018-02-07 20:54:40.194789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11341 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-02-07 20:54:40.376314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11374 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-07 20:54:40.556361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11374 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-02-07 20:54:40.740179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11374 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:84:00.0, compute capability: 6.1)
>>> print(sess.run(hello))
b'Hello, TensorFlow!'
对于此硬件设置,具有8个执行器的mnist示例(导致每个数据节点由两个执行器着陆)可以运行,一些工作人员可能会遇到下面的错误“OS error”,但在自动重试后,它将正常进入训练,并最终按预期输出模型

18/02/07 21:05:36 ERROR Executor: Exception in task 13.0 in stage 0.0 (TID 13)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1320, in _run_fn
    self._extend_graph()
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1381, in _extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000016/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000016/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 794, in func
然而,当将执行器编号调整为16时,上述错误出现的频率更高,并且没有工人能够进入培训,最终工作失败并退出。由于我们有16个GPU,所以集群应该将每个执行器转移到一个GPU上。对于单节点tensorflow测试,所有GPU都可以访问。(在PCI板中,只能在0-1和2-3对内访问,但我认为它不应影响执行器通信)

另一个试验是,在我启动一个运行8个执行器的作业并且每个GPU开始训练之后,我启动另一个运行8个执行器的作业,第二个作业仍然失败,情况与16个执行器相同

16 executor运行的日志与运行脚本一起位于附件中。请检查一下,帮我找出是什么情况。谢谢