tensorflow.python.framework.errors.InvalidArgumentError:WhereOp:计算真实元素数和写入它们之间的竞争条件

tensorflow.python.framework.errors.InvalidArgumentError:WhereOp:计算真实元素数和写入它们之间的竞争条件,python,tensorflow,gpu,distributed,Python,Tensorflow,Gpu,Distributed,我正在运行distributed tensorflow版本的deep-mnist(asynchronous数据并行,在使用gpu的不同机器上运行)。我为一个历元拍摄的批量大小是1000张图像,我正在运行1000个历元。我的架构在机器1上有一个ps和两个worker,在机器2上有worker task\u index=0和chief,机器1有worker task\u index=1 我真的不明白为什么有时在机器1上运行工作者任务_index=1时会出现这个错误。但是,如果我尝试运行几次,错误不会

我正在运行
distributed tensorflow
版本的
deep-mnist
asynchronous
数据并行,在使用gpu的不同机器上运行)。我为一个历元拍摄的批量大小是1000张图像,我正在运行1000个历元。我的架构在机器1上有一个
ps
和两个worker,在机器2上有
worker task\u index=0
chief
,机器1有
worker task\u index=1

我真的不明白为什么有时在机器1上运行
工作者任务_index=1
时会出现这个错误。但是,如果我尝试运行几次,错误不会持续

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
cannot import name hashtable
cannot import name hashtable
cannot import name hashtable
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:118] Found device 0 with properties: 
name: GeForce GTX 750 Ti
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 125.53MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:138] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:148] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:868] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:02:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 125.53M (131629056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {172.25.1.127:2223}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {172.25.1.108:2222, localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:203] Started server with target: grpc://localhost:2222
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
  File "dist_trainer_deepMnist_v2.py", line 205, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dist_trainer_deepMnist_v2.py", line 157, in main
    with sv.managed_session(server.target) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 969, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 797, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 958, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 722, in prepare_or_wait_for_session
    max_wait_secs=max_wait_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 351, in wait_for_session
    is_ready, not_ready_msg = self._model_ready(sess)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 437, in _model_ready
    return self._ready(self._ready_op, sess, "Model not ready")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in _ready
    ready_value = sess.run(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: WhereOp: Race condition between counting the number of true elements and writing them.  When counting, saw 2402 elements; but when writing their indices, saw 19 elements.
     [[Node: report_uninitialized_variables/Where = Where[_device="/job:ps/replica:0/task:0/cpu:0"](report_uninitialized_variables/Reshape_1)]]
Caused by op u'report_uninitialized_variables/Where', defined at:
  File "dist_trainer_deepMnist_v2.py", line 205, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dist_trainer_deepMnist_v2.py", line 153, in main
    save_model_secs=600)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 310, in __init__
    ready_op=ready_op, ready_for_local_init_op=ready_for_local_init_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 399, in _init_ready_op
    ready_op = variables.report_uninitialized_variables()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1031, in report_uninitialized_variables
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 888, in boolean_mask
    return _apply_mask_1d(tensor, mask)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 863, in _apply_mask_1d
    indices = squeeze(where(mask), squeeze_dims=[1])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2663, in where
    result = _op_def_lib.apply_op("Where", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2333, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1252, in __init__
    self._traceback = _extract_stack()
另一个线索是

E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 125.53M (131629056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

这就是当我在同样使用gpu的机器上运行ps时的情况。如果有人能解释为什么会发生这种情况,那就太好了。

似乎是个真正的错误。您介意在Github中提交问题吗:?好的。我会的。谢谢你,看起来真是个讨厌的家伙。您介意在Github中提交问题吗:?好的。我会的。非常感谢。
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 125.53M (131629056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY