Docker 仅通过声明TF Keras度量,GPU内存不足错误

Docker 仅通过声明TF Keras度量,GPU内存不足错误,docker,tensorflow,memory,keras,Docker,Tensorflow,Memory,Keras,我最近将代码从本地移动到支持GPU的服务器上,遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已缩减为 import tensorflow as tf METRICS = [ tf.keras.metrics.Precision(name='precision') ] …但我还是遇到了一个错误。没有其他进程正在运行。我在docker容器(tensorflow/tensorflow:latest-gpu-py3)内进行此操作,顺便说一句,这可能是问

我最近将代码从本地移动到支持GPU的服务器上,遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已缩减为

import tensorflow as tf

METRICS = [
      tf.keras.metrics.Precision(name='precision')
]
…但我还是遇到了一个错误。没有其他进程正在运行。我在docker容器(tensorflow/tensorflow:latest-gpu-py3)内进行此操作,顺便说一句,这可能是问题所在,但我找不到要更改的正确参数

非常感谢你的帮助

版本:Docker 17.12.1-ce、TF 2.1.0、Keras 2.3.1

Docker命令:

docker run --runtime=nvidia -it --rm -v tensorflow/tensorflow:latest-gpu-py3 bash
nvidia smi输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:17:00.0 Off |                  N/A |
| 54%   68C    P2   137W / 200W |   7931MiB /  8119MiB |     81%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:65:00.0 Off |                  N/A |
| 33%   34C    P8    10W / 200W |    115MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
整个错误输出如下:

2020-04-20 11:05:08.088874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-20 11:05:08.090195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-20 11:05:08.747745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-20 11:05:08.751503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.751963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.753290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.753551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.754983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.755747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.755786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:08.757022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:08.757267: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-20 11:05:08.782042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-04-20 11:05:08.783237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5dd24e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.783274: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-20 11:05:08.996600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e37ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.996653: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.996670: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.998089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.999200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.999241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.999270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.999298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.999327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.999359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:09.004066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:09.004172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:09.561399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-20 11:05:09.561437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1
2020-04-20 11:05:09.561442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N Y
2020-04-20 11:05:09.561446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   Y N
2020-04-20 11:05:09.562474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:17:00.0, compute capability: 6.1)
2020-04-20 11:05:09.563399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7460 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-04-20 11:05:09.570968: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 37.56M (39387136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-04-20 11:05:09.572125: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 33.81M (35448576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "sample.py", line 4, in <module>
    tf.keras.metrics.Precision(name='precision')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 1186, in __init__
    initializer=init_ops.zeros_initializer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 276, in add_weight
    aggregation=aggregation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 446, in add_weight
    caching_device=caching_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 744, in _add_variable_with_custom_getter
    **kwargs_for_getter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 142, in make_variable
    shape=variable_shape if variable_shape else None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2596, in default_variable_creator
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
    distribute_strategy=distribute_strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
    graph_mode=self._in_graph_mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
    shape, dtype, shared_name, name, graph_mode, initial_value)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
    math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse
2020-04-20 11:05:08.088874:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libnvere.so.6
2020-04-20 11:05:08.090195:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开了动态库libnvere_插件
2020-04-20 11:05:08.747745:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuda.so.1
2020-04-20 11:05:08.751503:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备0:
pciBusID:0000:17:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.751905:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备1:
pciBusID:0000:65:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.751936:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开了动态库libcudart.so.10.1
2020-04-20 11:05:08.751963:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开动态库libcublas.so.10
2020-04-20 11:05:08.753290:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuft.so.10
2020-04-20 11:05:08.753551:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcurand.so.10
2020-04-20 11:05:08.754983:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusolver.so.10
2020-04-20 11:05:08.755747:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusparse.so.10
2020-04-20 11:05:08.755786:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudnn.so.7
2020-04-20 11:05:08.757022:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697]添加可见gpu设备:0,1
2020-04-20 11:05:08.757267:I tensorflow/core/platform/cpu_feature_guard.cc:142]您的cpu支持未编译此tensorflow二进制文件以使用的指令:AVX2 AVX512F FMA
2020-04-20 11:05:08.782042:I tensorflow/core/platform/profile_utils/cpu_utils.cc:94]cpu频率:3600000000 Hz
20204-04-2011:05: 8.783237:I TysFult/编译器/ XLA/Service / Service .CC:168)为平台主机初始化的XLA服务0x5DD24E0(这并不保证XLA将被使用)。设备:
2020-04-20 11:05:08.783274:I tensorflow/compiler/xla/service/service.cc:176]StreamExecutor设备(0):主机,默认版本
XLA 202X0420:11:05:8.996600:I TysFult/Cyp/XLA/Service / Service .CC:168)XLA服务0x5E37 CA0初始化为平台CUDA(这不保证XLA将被使用)。设备:
2020-04-20 11:05:08.996653:I tensorflow/compiler/xla/service/service.cc:176]流执行器设备(0):GeForce GTX 1080,计算能力6.1
2020-04-20 11:05:08.996670:I tensorflow/compiler/xla/service/service.cc:176]流执行器设备(1):GeForce GTX 1080,计算能力6.1
2020-04-20 11:05:08.998089:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备0:
pciBusID:0000:17:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.999119:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备1:
pciBusID:0000:65:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.999175:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudart.so.10.1
2020-04-20 11:05:08.999200:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcublas.so.10
2020-04-20 11:05:08.999241:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuft.so.10
2020-04-20 11:05:08.999270:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcurand.so.10
2020-04-20 11:05:08.999298:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusolver.so.10
2020-04-20 11:05:08.999327:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusparse.so.10
2020-04-20 11:05:08.999359:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudnn.so.7
2020-04-20 11:05:09.004066:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697]添加可见gpu设备:0,1
2020-04-20 11:05:09.004172:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudart.so.10.1
2020-04-20 11:05:09.561399:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096]设备互连拖缆执行器与强度1边缘矩阵:
2020-04-20 11:05:09.561437:I tensorflow/core/common_runtime/gpu/gpu_device.cc:110