Python-无法在Sagemaker中训练Tensorflow模型容器

Python-无法在Sagemaker中训练Tensorflow模型容器,python,docker,tensorflow,amazon-ecs,amazon-sagemaker,Python,Docker,Tensorflow,Amazon Ecs,Amazon Sagemaker,我是Sagemaker和Docker的新手。我正在尝试使用ECS容器在Sagemaker中训练我自己的自定义对象检测算法。我正在使用此回购协议的文件: 我完全按照说明进行了操作,并且能够在本地机器上完美地运行容器中的图像。但是当我把镜像推到ECS上在Sagemaker中运行时,我在Cloudwatch中得到如下消息: 我知道由于某种原因,当部署到ECS时,映像突然找不到python。在我的培训脚本的顶部是文本#/usr/bin/env-python。我尝试运行*which python*命令,

我是Sagemaker和Docker的新手。我正在尝试使用ECS容器在Sagemaker中训练我自己的自定义对象检测算法。我正在使用此回购协议的文件:

我完全按照说明进行了操作,并且能够在本地机器上完美地运行容器中的图像。但是当我把镜像推到ECS上在Sagemaker中运行时,我在Cloudwatch中得到如下消息:

我知道由于某种原因,当部署到ECS时,映像突然找不到python。在我的培训脚本的顶部是文本#/usr/bin/env-python。我尝试运行*which python*命令,并将文本改为指向#/usr/local/bin-python,但我只得到了额外的错误。我不明白为什么这个图像会在我的本地计算机上工作(在windows上使用docker和WSL的docker CE进行测试)。以下是docker文件的一个片段:

ARG ARCHITECTURE=1.15.0-gpu
FROM tensorflow/tensorflow:${ARCHITECTURE}-py3

RUN apt-get update && apt-get install -y --no-install-recommends \
        wget zip unzip git ca-certificates curl nginx python

# We need to install Protocol Buffers (Protobuf). Protobuf is Google's language and platform-neutral,  
# extensible mechanism for serializing structured data. To make sure you are using the most updated code,
# replace the linked release below with the latest version available on the Git repository.
RUN curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.10.1/protoc-3.10.1-linux-x86_64.zip
RUN unzip protoc-3.10.1-linux-x86_64.zip -d protoc3
RUN mv protoc3/bin/* /usr/local/bin/
RUN mv protoc3/include/* /usr/local/include/

# Let's add the folder that we are going to be using to install all of our machine learning-related code
# to the PATH. This is the folder used by SageMaker to find and run our code.
ENV PATH="/opt/ml/code:${PATH}"
RUN mkdir -p /opt/ml/code
WORKDIR /opt/ml/code

RUN pip install --upgrade pip
RUN pip install cython
RUN pip install contextlib2
RUN pip install pillow
RUN pip install lxml
RUN pip install matplotlib
RUN pip install flask
RUN pip install gevent
RUN pip install gunicorn
RUN pip install pycocotools

# Let's now download Tensorflow from the official Git repository and install Tensorflow Slim from
# its folder.
RUN git clone https://github.com/tensorflow/models/ tensorflow-models
RUN pip install -e tensorflow-models/research/slim

# We can now install the Object Detection API, also part of the Tensorflow repository. We are going to change
# the working directory for a minute so we can do this easily.
WORKDIR /opt/ml/code/tensorflow-models/research
RUN protoc object_detection/protos/*.proto --python_out=.
RUN python setup.py build
RUN python setup.py install

# If you are interested in using COCO evaluation metrics, you can tun the following commands to add the
# necessary resources to your Tensorflow installation.
RUN git clone https://github.com/cocodataset/cocoapi.git
WORKDIR /opt/ml/code/tensorflow-models/research/cocoapi/PythonAPI
RUN make 
RUN cp -r pycocotools /opt/ml/code/tensorflow-models/research/

# Let's put the working directory back to where it needs to be, copy all of our code, and update the PYTHONPATH
# to include the newly installed Tensorflow libraries.
WORKDIR /opt/ml/code
COPY /code /opt/ml/code

ENV PYTHONPATH=${PYTHONPATH}:tensorflow-models/research:tensorflow-models/research/slim:tensorflow-models/research/object_detection

RUN chmod +x /opt/ml/code/train
CMD ["/bin/bash","-c","chmod +x /opt/ml/code/train && /opt/ml/code/train"]

FROM位于顶部,它在底部使用tensorflow docker的官方图像。在这种情况下,它将是tensorflow/tensorflow:1.15.0-gpu-py3可能对您有帮助我已经尝试过了,但没有成功。我也尝试过从Ubuntu构建和部署。你能确定哪一步失败了吗?似乎出于某种原因,
pyhon
不在路径上。这是发生的第一步。Sagemaker会自动运行任何名为train的脚本,但train是一个可执行文件,因此它必须将shebang放在顶部。