Pytorch BERT模型加载不适用于Pyrotch 1.3.1-eia容器

Pytorch BERT模型加载不适用于Pyrotch 1.3.1-eia容器,pytorch,amazon-sagemaker,Pytorch,Amazon Sagemaker,我有一个用于推理的自定义python文件,其中实现了函数model\u fn、input\u fn、predict\u fn和output\u fn。我使用torch.jit.trace,torch.jit.save将模型保存为torchscript,并使用torch.jit.load加载它。模型fn实现如下: import torch import os import logging logger = logging.getLogger() is_ei = os.getenv("S

我有一个用于推理的自定义python文件,其中实现了函数
model\u fn
input\u fn
predict\u fn
output\u fn
。我使用
torch.jit.trace
torch.jit.save
将模型保存为torchscript,并使用
torch.jit.load
加载它。
模型fn
实现如下:

import torch
import os
import logging

logger = logging.getLogger()
is_ei = os.getenv("SAGEMAKER_INFERENCE_ACCELERATOR_PRESENT") == "true"
logger.warn(f"Elastic Inference enabled: {is_ei}")

def model_fn(model_dir):
    model_path = os.path.join(model_dir, "model_best.pt")
    try:
        loaded_model = torch.jit.load(model_path, map_location=torch.device('cpu'))
        loaded_model.eval()
        return loaded_model
    except Exception as e:
        logger.exception(f"Exception in model fn {e}")
        return None
此实现非常适用于使用pytorch 1.5的容器。但对于装有火炬1.3.1的容器,在装载预训练模型时,火炬会突然退出,而没有任何原木。我在日志中看到的唯一一行是

algo-1-nvqf7_1  | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
algo-1-nvqf7_1  | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds.
工人死亡并尝试重新启动,该过程重复,直到我停止容器

我使用的模型是用pytorch 1.5训练的。但是由于EI支持只支持到1.3.1,所以我使用这个容器

我尝试过的事情:

  • 在pytorch版本1.3.1的容器外,使用相同模型的相同代码也可以工作。所以,我认为pytorch版本兼容性不是问题
  • 尝试对日志使用
    debug
    notset
    级别。没有得到更多关于模型加载失败原因的信息
  • 尝试加载原始模型而不是跟踪模型。这同样适用于1.5,但不适用于1.3.1。加载BERT预训练模型时,在同一点失败
  • 在sagemaker笔记本实例上使用gpu加速器和sagemaker
    PytorchModel的deploy()
    函数以及
    framework\U版本作为1.3.1尝试了此设置。还尝试了使用1.3.1容器而不使用
    eia
    。到处都有同样的行为
  • 我是做错了什么,还是文档中遗漏了一些重要的东西?任何帮助都将不胜感激

    **火炬容器原木1.3.1-eia**

    algo-1-nvqf7_1  | 2020-11-30 07:17:14,333 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
    algo-1-nvqf7_1  | MMS Home: /opt/conda/lib/python3.6/site-packages
    algo-1-nvqf7_1  | Current directory: /
    algo-1-nvqf7_1  | Temp directory: /home/model-server/tmp
    algo-1-nvqf7_1  | Number of GPUs: 0
    algo-1-nvqf7_1  | Number of CPUs: 8
    algo-1-nvqf7_1  | Max heap size: 6972 M
    algo-1-nvqf7_1  | Python executable: /opt/conda/bin/python
    algo-1-nvqf7_1  | Config file: /etc/sagemaker-mms.properties
    algo-1-nvqf7_1  | Inference address: http://0.0.0.0:8080
    algo-1-nvqf7_1  | Management address: http://0.0.0.0:8080
    algo-1-nvqf7_1  | Model Store: /.sagemaker/mms/models
    algo-1-nvqf7_1  | Initial Models: ALL
    algo-1-nvqf7_1  | Log dir: /logs
    algo-1-nvqf7_1  | Metrics dir: /logs
    algo-1-nvqf7_1  | Netty threads: 0
    algo-1-nvqf7_1  | Netty client threads: 0
    algo-1-nvqf7_1  | Default workers per model: 1
    algo-1-nvqf7_1  | Blacklist Regex: N/A
    algo-1-nvqf7_1  | Maximum Response Size: 6553500
    algo-1-nvqf7_1  | Maximum Request Size: 6553500
    algo-1-nvqf7_1  | Preload model: false
    algo-1-nvqf7_1  | Prefer direct buffer: false
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,391 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,481 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_pytorch_serving_container.handler_service --model-path /.sagemaker/mms/models/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 51
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,483 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,483 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded.
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,487 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,496 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,544 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,545 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
    algo-1-nvqf7_1  | Model server started.
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,547 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
    algo-1-nvqf7_1  | 2020-11-30 07:17:14,962 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.3.1 available.
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,314 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,315 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpcln39mxo
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,344 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,349 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt in cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,350 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Created tokenizer
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Elastic Inference enabled: True
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - inside model fn
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model/model.pt
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ['model.pt', 'model.tar.gz', 'code', 'model_tn_best.pth', 'MAR-INF']
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading torch script
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,392 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-44f1cd64 Worker disconnected. WORKER_STARTED
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
    algo-1-nvqf7_1  | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds.
    algo-1-nvqf7_1  | 2020-11-30 07:17:16,065 [INFO ] W-9000-model ACCESS_LOG - /172.18.0.1:45110 "GET /ping HTTP/1.1" 200 8
    
    algo-1-nvqf7|2020-11-30 07:17:14333[INFO]main com.amazonaws.ml.mms.ModelServer-
    algo-1-nvqf7|1 |彩信主页:/opt/conda/lib/python3.6/site-packages
    algo-1-nvqf7|当前目录:/
    algo-1-nvqf7|1 |临时目录:/home/model server/tmp
    algo-1-nvqf7| GPU数量:0
    algo-1-nvqf7| CPU数量:8
    algo-1-nvqf7|最大堆大小:6972 M
    algo-1-nvqf7| Python可执行文件:/opt/conda/bin/Python
    algo-1-nvqf7|1 |配置文件:/etc/sagemaker-mms.properties
    algo-1-nvqf7|U 1 |推断地址:http://0.0.0.0:8080
    algo-1-nvqf7|管理地址:http://0.0.0.0:8080
    algo-1-nvqf7|1 |模型商店:/.sagemaker/mms/models
    algo-1-nvqf7|初始模型:所有
    algo-1-nvqf7|U 1 |日志目录:/logs
    algo-1-nvqf7|U 1 |度量目录:/logs
    algo-1-nvqf7|网状螺纹:0
    algo-1-nvqf7|网络客户端线程:0
    algo-1-nvqf7| 1 |每个模型的默认工作程序:1
    algo-1-nvqf7|黑名单正则表达式:不适用
    algo-1-nvqf7|U 1 |最大响应大小:6553500
    algo-1-nvqf7|U 1 |最大请求大小:6553500
    algo-1-nvqf7|U 1 |预加载模型:错误
    algo-1-nvqf7|U 1 |首选直接缓冲区:false
    algo-1-nvqf7|1 | 2020-11-30 07:17:14391[警告]W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle-attachIOStreams()threadName=W-9000-model
    algo-1-nvqf7|2020-11-30 07:17:14481[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-model\u service\u worker从args开始--sock-type unix--sock-name/home/model-server/tmp/.mms.sock.9000--handler sagemaker\u pytorch\u serving\u container.handler\u服务--path/.sagemaker/mms/models/model---model-name-model---preload model-false--tmp dir/home/model/server/tmp
    algo-1-nvqf7|1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-监听端口:/home/model server/tmp/.mms.sock.9000
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-[PID]51
    algo-1-nvqf7|1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-mms worker已启动。
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:14483[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-Python运行时:3.6.6
    algo-1-nvqf7|2020-11-30 07:17:14483[信息]main com.amazonaws.ml.mms.wlm.ModelManager-模型已加载。
    algo-1-nvqf7|1 | 2020-11-30 07:17:14487[INFO]main com.amazonaws.ml.mms.ModelServer-使用EpollServerSocketChannel初始化推理服务器。
    algo-1-nvqf7|1 | 2020-11-30 07:17:14496[信息]W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread-连接到:/home/model server/tmp/.mms.sock.9000
    algo-1-nvqf7|2020-11-30 07:17:14544[信息]main com.amazonaws.ml.mms.ModelServer-推理API绑定到:http://0.0.0.0:8080
    algo-1-nvqf7|1 | 2020-11-30 07:17:14545[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-已接受连接:/home/model server/tmp/.mms.sock.9000。
    algo-1-nvqf7|1 |模型服务器已启动。
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:14547[警告]pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector-工作pid尚不可用。
    algo-1-nvqf7|1 | 2020-11-30 07:17:14962[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-PyTorch版本1.3.1可用。
    algo-1-nvqf7|u 1 2020-11-30 07:17:15314[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-锁140580224398952在/root/.cache/torch/transformers/26BC1AD6C0AC742E9B523248F6D00068293B33709FAE12320C05CFBBB.542CE3285A40D23A59526243235DF47C575C197F04F37D1A0242C9A084.Lock上获取
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:15315[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt 在缓存中找不到或强制下载设置为True,下载到/root/.cache/torch/transformers/tmpcln39mxo
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:15344[警告]W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle-
    algo-1-nvqf7|u 1 | 2020-11-30 07:17:15349[警告]W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle-下载