Keras/Tensorflow在SageMaker管理的AWS ml.p2.xlarge实例上未检测到GPU_Keras_Gpu_Amazon Sagemaker

Keras/Tensorflow在SageMaker管理的AWS ml.p2.xlarge实例上未检测到GPU

keras

Keras/Tensorflow在SageMaker管理的AWS ml.p2.xlarge实例上未检测到GPU,keras,gpu,amazon-sagemaker,Keras,Gpu,Amazon Sagemaker,我在ml.p2.xlarge实例上使用一个定制Docker容器与SageMaker一起使用基本映像是，它通常随所需的CUDA工具包一起提供。python包是通过conda使用以下最低限度的环境安装的。yaml： dependencies: - boto3 - joblib - keras - numpy - pandas - scikit-learn - scipy - tensorflow=2.0 但是，当我为一个小型的lenet5CNN运行培训作业时，我在

我在

ml.p2.xlarge

实例上使用一个定制Docker容器与SageMaker一起使用

基本映像是，它通常随所需的CUDA工具包一起提供。python包是通过conda使用以下最低限度的

环境安装的。yaml

：

dependencies:
  - boto3
  - joblib
  - keras
  - numpy
  - pandas
  - scikit-learn
  - scipy
  - tensorflow=2.0

但是，当我为一个小型的

lenet5

CNN运行培训作业时，我在日志中看不到任何GPU活动（培训的持续时间与在非GPU实例上一样长）

更令人担忧的是，

len（tf.config.experiative.list_physical_devices（'GPU'）

，以及

K.tensorflow_backend.\u get_available_GPU（）

为空。最后，如果我检查设备放置（使用

tf.debug.set_log_device_placement（True）

）的基本操作，例如：

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

我明白了

Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0

确认操作已在CPU上进行

起初，我认为我的用例太轻，无法触发GPU的使用，但似乎GPU根本没有被检测到！我是否错过了工作所需的任何步骤或组件？

我建议从SageMaker提供的环境开始，以确保您有一个经过测试的、最新的和生产就绪的设置。特别是对于Tensorflow和d Keras，即：

在SageMaker笔记本电脑上，
```
conda\u tensorflow\u p*
```
jupyter内核
对于SageMaker培训和推理任务，TensorFlow框架容器（，）

RUN apt-get update \
 && apt-get install -y --no-install-recommends --allow-unauthenticated \
    python3-dev \
    python3-pip \
    python3-setuptools \
    python3-dev \
    ca-certificates \
    cuda-command-line-tools-10-0 \
    cuda-cublas-dev-10-0 \
    cuda-cudart-dev-10-0 \
    cuda-cufft-dev-10-0 \
    cuda-curand-dev-10-0 \
    cuda-cusolver-dev-10-0 \
    cuda-cusparse-dev-10-0 \
    curl \
    libcudnn7=7.5.1.10-1+cuda10.0 \
    # TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
    libnccl2=2.4.7-1+cuda10.0 \
    libgomp1 \
    libnccl-dev=2.4.7-1+cuda10.0 \
    libfreetype6-dev \
    libhdf5-serial-dev \
    libpng-dev \
    libzmq3-dev \
    git \
    wget \
    vim \
    build-essential \
    openssh-client \
    openssh-server \
    zlib1g-dev \
    # The 'apt-get install' of nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0
    # adds a new list which contains libnvinfer library, so it needs another
    # 'apt-get update' to retrieve that list before it can actually install the
    # library.
    # We don't install libnvinfer-dev since we don't need to build against TensorRT,
    # and libnvinfer4 doesn't contain libnvinfer.a static library.
 && apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated  \
    nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0 \
 && apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated  \
    libnvinfer5=5.0.2-1+cuda10.0 \
 && rm /usr/lib/x86_64-linux-gnu/libnvinfer_plugin* \
 && rm /usr/lib/x86_64-linux-gnu/libnvcaffe_parser* \
 && rm /usr/lib/x86_64-linux-gnu/libnvparsers* \
 && rm -rf /var/lib/apt/lists/* \
 && mkdir -p /var/run/sshd

SageMaker TensorFlow容器

TensorFlow GPU 2.1.0