Amazon web services 在AWS Sagemaker上部署带有TF服务容器的TF2.1模型后ping healthcheck失败

Amazon web services 在AWS Sagemaker上部署带有TF服务容器的TF2.1模型后ping healthcheck失败,amazon-web-services,tensorflow,machine-learning,tensorflow-serving,amazon-sagemaker,Amazon Web Services,Tensorflow,Machine Learning,Tensorflow Serving,Amazon Sagemaker,我们希望将一个经过训练的Tensorflow模型部署到AWS Sagemaker,以便使用Tensorflow服务容器进行推理。Tensorflow版本是2.1。按照指南,已采取以下步骤: 构建TF2.1 AMI,并在本地测试成功后将其发布到AWS ECR 为S3和ECR设置Sagemaker执行角色权限 将保存的TF模型文件夹(保存的_model.pb、资产、变量)打包到model.tar.gz中 已创建具有实时预测器的端点: 创建批处理转换作业: 步骤4和5都在运行,在AWS Cloudwa

我们希望将一个经过训练的Tensorflow模型部署到AWS Sagemaker,以便使用Tensorflow服务容器进行推理。Tensorflow版本是2.1。按照指南,已采取以下步骤:

  • 构建TF2.1 AMI,并在本地测试成功后将其发布到AWS ECR
  • 为S3和ECR设置Sagemaker执行角色权限
  • 将保存的TF模型文件夹(保存的_model.pb、资产、变量)打包到model.tar.gz中
  • 已创建具有实时预测器的端点:
  • 创建批处理转换作业:
  • 步骤4和5都在运行,在AWS Cloudwatch日志中,我们看到实例成功启动、模型加载和TF服务进入事件循环–见下文:

    2020-07-08T17:07:16.156+02:00信息:main:开始服务

    2020-07-08T17:07:16.156+02:00信息:main:nginx配置:

    2020-07-08T17:07:16.156+02:00加载模块 modules/ngx_http_js_module.so

    2020-07-08T17:07:16.156+02:00工人自动加工

    2020-07-08T17:07:16.156+02:00守护进程关闭

    2020-07-08T17:07:16.156+02:00PID/tmp/nginx.pid

    2020-07-08T17:07:16.157+02:00错误日志/开发/标准错误

    2020-07-08T17:07:16.157+02:00工人档案4096

    2020-07-08T17:07:16.157+02:00事件{worker_connections 2048

    2020-07-08T17:07:16.157+02:00}

    2020-07-08T17:07:16.162+02:00 http{include/etc/nginx/mime.types; 默认类型应用程序/json;访问日志/dev/stdout组合; js_包括tensorflow-serving.js;上游tfs_上游{server 本地主机:10001;}上游gunicorn_上游{服务器 unix:/tmp/gunicorn.sock fail_timeout=1;}服务器{listen 8080 延迟;客户端最大体大小0;客户端体大小100m; 子请求\输出\缓冲区\大小100m;设置$tfs\版本2.1;设置 $default_tfs_model None;location/tfs{rewrite^/tfs/(*)/$1 break; 代理\u重定向关闭;代理\u传递\u请求\u标头关闭;代理\u设置\u标头 内容类型“application/json”;代理集头接受 “应用程序/json”;代理程序http://tfs_upstream;}位置/ping{ js_内容ping;}位置/调用{js_内容调用;} 位置/型号{proxy_passhttp://gunicorn_upstream/models; } 位置/{return 404'{“error”:“notfound”};}keepalive_超时 3、 }

    2020-07-08T17:07:16.162+02:00}

    2020-07-08T17:07:16.162+02:00信息:tfs_实用程序:使用默认型号名称: 模型

    2020-07-08T17:07:16.162+02:00信息:tfs_utils:tensorflow服务模式 配置:

    2020-07-08T17:07:16.162+02:00车型配置列表:{config:{name: “模型”,基本路径:“/opt/ml/model”,模型平台:“tensorflow”}

    2020-07-08T17:07:16.162+02:00}

    2020-07-08T17:07:16.162+02:00信息:main:使用默认型号名称: 模型

    2020-07-08T17:07:16.162+02:00信息:main:tensorflow服务模式 配置:

    2020-07-08T17:07:16.163+02:00车型配置列表:{config:{name: “模型”,基本路径:“/opt/ml/model”,模型平台:“tensorflow”}

    2020-07-08T17:07:16.163+02:00}

    2020-07-08T17:07:16.163+02:00信息:main:tensorflow版本信息:

    2020-07-08T17:07:16.163+02:00 TensorFlow模型服务器: 2.1.0-rc1+dev.sha.075ffcf

    2020-07-08T17:07:16.163+02:00 TensorFlow库:2.1.0

    2020-07-08T17:07:16.163+02:00信息:main:tensorflow服务 命令:tensorflow\u model\u server--port=10000--rest\u api\u port=10001 --model\u config\u file=/sagemaker/model-config.cfg--max\u num\u load\u retries=0

    2020-07-08T17:07:16.163+02:00信息:main:开始tensorflow服务 (pid:13)

    2020-07-08T17:07:16.163+02:00信息:main:nginx版本信息:

    2020-07-08T17:07:16.163+02:00 nginx版本:nginx/1.18.0

    2020-07-08T17:07:16.163+02:00由gcc 7.4.0(Ubuntu)构建 7.4.0-1ubuntu1~18.04.1)

    2020-07-08T17:07:16.163+02:00使用OpenSSL 1.1.1构建2018年9月11日

    2020-07-08T17:07:16.163+02:00启用TLS SNI支持

    2020-07-08T17:07:16.163+02:00配置参数:--prefix=/etc/nginx --sbin path=/usr/sbin/nginx--modules path=/usr/lib/nginx/modules--conf path=/etc/nginx/nginx.conf--error log path=/var/log/nginx/error.log--http log path=/var/log/nginx/access.log--pid path=/var/run/nginx.pid--lock=/var/run nginx.lock--http客户机主体临时路径=/var/cache/nginx/client--http proxy temp path=/var/cache/nginx/proxy_temp--http fastcgi temp path=/var/cache/nginx/fastcgi_temp--http-uwsgi temp=/var/cache/nginx/uwsgi_temp--user=nginx--group=nginx--带有文件aio--带有线程--带有-http_添加模块--带有-http_身份验证请求模块--带-http_dav_模块——带-http_flv_模块——带-http_gunzip_模块——带-http_gzip_静态_模块——带-http_mp4_模块——带-http_随机_索引_模块——带-http_realip_模块——带-http_安全链接_模块——带-http_切片_模块——带-http_ssl_模块——带-http_存根状态_模块——带-http_子模块——带-http_v2模块--带邮件——带邮件\u ssl\u模块——带流——带流\u realip\u模块——带流\u ssl\u模块——带流\u ssl\u预读\u模块——带cc opt='-g-O2-fdebug前缀map=/data/builder/debild/nginx-1.18.0/debian/debild base/nginx-1.18.0=-fstack protector strong-Wformat-Werror=格式安全性-Wp,-D_-FORTIFY_-SOURCE=2-fPIC'-带ld opt='-Wl,-b符号函数-Wl,-z,relro-Wl,-z,now-Wl,--根据需要-pie'

    2020-07-08T17:07:16.163+02:00信息:main:启动nginx(pid:15)

    2020-07-08T17:07:16.163+02:00 2020-07-08 15:07:15.075708:I tensorflow_serving/model_servers/server_core.cc:462]添加/更新 模型

    2020-07-08T17:07:16.163+02:00 2020-07-08 15:07:15.075760:I tensorflow_serving/model_servers/server_core.cc:573](重新)添加 型号:型号

    2020-07-08T17:07:16.163+02:00 2020-07-08 15:07:1
    import os
    import sagemaker
    from sagemaker.tensorflow.serving import Model
    from sagemaker.tensorflow.model import TensorFlowModel
    from sagemaker.predictor import json_deserializer, json_serializer, RealTimePredictor
    from sagemaker.content_types import CONTENT_TYPE_JSON
    
    def create_tfs_sagemaker_model():
        sagemaker_session = sagemaker.Session()
        role = 'arn:aws:iam::XXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXX
        bucket = 'tf-serving'
        prefix = 'sagemaker/tfs-test'
        s3_path = 's3://{}/{}'.format(bucket, prefix)
        image = 'XXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
        model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
        endpoint_name = 'tf-serving-ep-test-1'
        tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, framework_version='2.1')
        tensorflow_serving_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
        rt_predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sagemaker_session, serializer=json_serializer, content_type=CONTENT_TYPE_JSON, accept=CONTENT_TYPE_JSON)
    
    def create_tfs_sagemaker_batch_transform():
        sagemaker_session = sagemaker.Session()
        print(sagemaker_session.boto_region_name)
        role = 'arn:aws:iam::XXXXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXX'
        bucket = 'XXXXXXX-tf-serving'
        prefix = 'sagemaker/tfs-test'
        image = 'XXXXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
        s3_path = 's3://{}/{}'.format(bucket, prefix)
        model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
        tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, name='deep-net-0', framework_version='2.1')
        print(tensorflow_serving_model.model_data)
        out_path = 's3://XXXXXX-serving-out/'
        input_path = "s3://XXXXXX-serving-in/"    
        tensorflow_serving_transformer = tensorflow_serving_model.transformer(instance_count=1, instance_type='ml.c4.xlarge', accept='application/json', output_path=out_path)
        tensorflow_serving_transformer.transform(input_path, content_type='application/json')
    
    2020-07-08T17:07:16.162+02:00 INFO:main:using default model name: model
    2020-07-08T17:07:16.162+02:00 INFO:main:tensorflow serving model config:
    
    `Could not find any versions of model None`
    
    class PythonServiceResource:
    
        def __init__(self):
            if SAGEMAKER_MULTI_MODEL_ENABLED:
                self._model_tfs_rest_port = {}
                self._model_tfs_grpc_port = {}
                self._model_tfs_pid = {}
                self._tfs_ports = self._parse_sagemaker_port_range(SAGEMAKER_TFS_PORT_RANGE)
            else:
                self._tfs_grpc_port = TFS_GRPC_PORT
                self._tfs_rest_port = TFS_REST_PORT
    
            self._tfs_enable_batching = SAGEMAKER_BATCHING_ENABLED == 'true'
            self._tfs_default_model_name = os.environ.get('TFS_DEFAULT_MODEL_NAME', "None")
    
    import os
    import sagemaker
    from sagemaker.tensorflow.serving import Model
    from sagemaker.tensorflow.model import TensorFlowModel
    from sagemaker.predictor import json_deserializer, json_serializer, RealTimePredictor
    from sagemaker.content_types import CONTENT_TYPE_JSON
    
    def create_tfs_sagemaker_model():
        sagemaker_session = sagemaker.Session()
        role = 'arn:aws:iam::XXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXX
        bucket = 'tf-serving'
        prefix = 'sagemaker/tfs-test'
        s3_path = 's3://{}/{}'.format(bucket, prefix)
        image = 'XXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
        model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
        endpoint_name = 'tf-serving-ep-test-1'
        env = {"SAGEMAKER_TFS_DEFAULT_MODEL_NAME": "model"}
        tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, name='model', framework_version='2.1', env=env)
        tensorflow_serving_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
        rt_predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sagemaker_session, serializer=json_serializer, content_type=CONTENT_TYPE_JSON, accept=CONTENT_TYPE_JSON)