Docker 仅通过声明TF Keras度量,GPU内存不足错误
我最近将代码从本地移动到支持GPU的服务器上,遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已缩减为Docker 仅通过声明TF Keras度量,GPU内存不足错误,docker,tensorflow,memory,keras,Docker,Tensorflow,Memory,Keras,我最近将代码从本地移动到支持GPU的服务器上,遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已缩减为 import tensorflow as tf METRICS = [ tf.keras.metrics.Precision(name='precision') ] …但我还是遇到了一个错误。没有其他进程正在运行。我在docker容器(tensorflow/tensorflow:latest-gpu-py3)内进行此操作,顺便说一句,这可能是问
import tensorflow as tf
METRICS = [
tf.keras.metrics.Precision(name='precision')
]
…但我还是遇到了一个错误。没有其他进程正在运行。我在docker容器(tensorflow/tensorflow:latest-gpu-py3)内进行此操作,顺便说一句,这可能是问题所在,但我找不到要更改的正确参数
非常感谢你的帮助
版本:Docker 17.12.1-ce、TF 2.1.0、Keras 2.3.1
Docker命令:
docker run --runtime=nvidia -it --rm -v tensorflow/tensorflow:latest-gpu-py3 bash
nvidia smi输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:17:00.0 Off | N/A |
| 54% 68C P2 137W / 200W | 7931MiB / 8119MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:65:00.0 Off | N/A |
| 33% 34C P8 10W / 200W | 115MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
整个错误输出如下:
2020-04-20 11:05:08.088874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-20 11:05:08.090195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-20 11:05:08.747745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-20 11:05:08.751503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.751963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.753290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.753551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.754983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.755747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.755786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:08.757022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:08.757267: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-20 11:05:08.782042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-04-20 11:05:08.783237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5dd24e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.783274: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-20 11:05:08.996600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e37ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.996653: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.996670: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.998089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.999200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.999241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.999270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.999298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.999327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.999359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:09.004066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:09.004172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:09.561399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-20 11:05:09.561437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-04-20 11:05:09.561442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N Y
2020-04-20 11:05:09.561446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: Y N
2020-04-20 11:05:09.562474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:17:00.0, compute capability: 6.1)
2020-04-20 11:05:09.563399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7460 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-04-20 11:05:09.570968: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 37.56M (39387136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-04-20 11:05:09.572125: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 33.81M (35448576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
File "sample.py", line 4, in <module>
tf.keras.metrics.Precision(name='precision')
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 1186, in __init__
initializer=init_ops.zeros_initializer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 276, in add_weight
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 446, in add_weight
caching_device=caching_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 744, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 142, in make_variable
shape=variable_shape if variable_shape else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2596, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
distribute_strategy=distribute_strategy)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
graph_mode=self._in_graph_mode)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
shape, dtype, shared_name, name, graph_mode, initial_value)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse
2020-04-20 11:05:08.088874:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libnvere.so.6
2020-04-20 11:05:08.090195:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开了动态库libnvere_插件
2020-04-20 11:05:08.747745:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuda.so.1
2020-04-20 11:05:08.751503:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备0:
pciBusID:0000:17:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.751905:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备1:
pciBusID:0000:65:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.751936:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开了动态库libcudart.so.10.1
2020-04-20 11:05:08.751963:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]成功打开动态库libcublas.so.10
2020-04-20 11:05:08.753290:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuft.so.10
2020-04-20 11:05:08.753551:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcurand.so.10
2020-04-20 11:05:08.754983:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusolver.so.10
2020-04-20 11:05:08.755747:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusparse.so.10
2020-04-20 11:05:08.755786:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudnn.so.7
2020-04-20 11:05:08.757022:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697]添加可见gpu设备:0,1
2020-04-20 11:05:08.757267:I tensorflow/core/platform/cpu_feature_guard.cc:142]您的cpu支持未编译此tensorflow二进制文件以使用的指令:AVX2 AVX512F FMA
2020-04-20 11:05:08.782042:I tensorflow/core/platform/profile_utils/cpu_utils.cc:94]cpu频率:3600000000 Hz
20204-04-2011:05: 8.783237:I TysFult/编译器/ XLA/Service / Service .CC:168)为平台主机初始化的XLA服务0x5DD24E0(这并不保证XLA将被使用)。设备:
2020-04-20 11:05:08.783274:I tensorflow/compiler/xla/service/service.cc:176]StreamExecutor设备(0):主机,默认版本
XLA 202X0420:11:05:8.996600:I TysFult/Cyp/XLA/Service / Service .CC:168)XLA服务0x5E37 CA0初始化为平台CUDA(这不保证XLA将被使用)。设备:
2020-04-20 11:05:08.996653:I tensorflow/compiler/xla/service/service.cc:176]流执行器设备(0):GeForce GTX 1080,计算能力6.1
2020-04-20 11:05:08.996670:I tensorflow/compiler/xla/service/service.cc:176]流执行器设备(1):GeForce GTX 1080,计算能力6.1
2020-04-20 11:05:08.998089:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备0:
pciBusID:0000:17:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.999119:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555]找到了具有以下属性的设备1:
pciBusID:0000:65:00.0名称:GeForce GTX 1080计算能力:6.1
核心时钟:1.7715GHz核心计数:20设备内存大小:7.93GiB设备内存带宽:298.32GiB/s
2020-04-20 11:05:08.999175:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudart.so.10.1
2020-04-20 11:05:08.999200:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcublas.so.10
2020-04-20 11:05:08.999241:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcuft.so.10
2020-04-20 11:05:08.999270:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcurand.so.10
2020-04-20 11:05:08.999298:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusolver.so.10
2020-04-20 11:05:08.999327:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcusparse.so.10
2020-04-20 11:05:08.999359:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudnn.so.7
2020-04-20 11:05:09.004066:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697]添加可见gpu设备:0,1
2020-04-20 11:05:09.004172:I tensorflow/stream_executor/platform/default/dso_loader.cc:44]已成功打开动态库libcudart.so.10.1
2020-04-20 11:05:09.561399:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096]设备互连拖缆执行器与强度1边缘矩阵:
2020-04-20 11:05:09.561437:I tensorflow/core/common_runtime/gpu/gpu_device.cc:110