Pytorch BERT模型加载不适用于Pyrotch 1.3.1-eia容器
我有一个用于推理的自定义python文件,其中实现了函数Pytorch BERT模型加载不适用于Pyrotch 1.3.1-eia容器,pytorch,amazon-sagemaker,Pytorch,Amazon Sagemaker,我有一个用于推理的自定义python文件,其中实现了函数model\u fn、input\u fn、predict\u fn和output\u fn。我使用torch.jit.trace,torch.jit.save将模型保存为torchscript,并使用torch.jit.load加载它。模型fn实现如下: import torch import os import logging logger = logging.getLogger() is_ei = os.getenv("S
model\u fn
、input\u fn
、predict\u fn
和output\u fn
。我使用torch.jit.trace
,torch.jit.save
将模型保存为torchscript,并使用torch.jit.load
加载它。模型fn
实现如下:
import torch
import os
import logging
logger = logging.getLogger()
is_ei = os.getenv("SAGEMAKER_INFERENCE_ACCELERATOR_PRESENT") == "true"
logger.warn(f"Elastic Inference enabled: {is_ei}")
def model_fn(model_dir):
model_path = os.path.join(model_dir, "model_best.pt")
try:
loaded_model = torch.jit.load(model_path, map_location=torch.device('cpu'))
loaded_model.eval()
return loaded_model
except Exception as e:
logger.exception(f"Exception in model fn {e}")
return None
此实现非常适用于使用pytorch 1.5的容器。但对于装有火炬1.3.1的容器,在装载预训练模型时,火炬会突然退出,而没有任何原木。我在日志中看到的唯一一行是
algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
algo-1-nvqf7_1 | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds.
工人死亡并尝试重新启动,该过程重复,直到我停止容器
我使用的模型是用pytorch 1.5训练的。但是由于EI支持只支持到1.3.1,所以我使用这个容器
我尝试过的事情:
debug
和notset
级别。没有得到更多关于模型加载失败原因的信息PytorchModel的deploy()
函数以及framework\U版本作为1.3.1尝试了此设置。还尝试了使用1.3.1容器而不使用eia
。到处都有同样的行为
algo-1-nvqf7_1 | 2020-11-30 07:17:14,333 [INFO ] main com.amazonaws.ml.mms.ModelServer -
algo-1-nvqf7_1 | MMS Home: /opt/conda/lib/python3.6/site-packages
algo-1-nvqf7_1 | Current directory: /
algo-1-nvqf7_1 | Temp directory: /home/model-server/tmp
algo-1-nvqf7_1 | Number of GPUs: 0
algo-1-nvqf7_1 | Number of CPUs: 8
algo-1-nvqf7_1 | Max heap size: 6972 M
algo-1-nvqf7_1 | Python executable: /opt/conda/bin/python
algo-1-nvqf7_1 | Config file: /etc/sagemaker-mms.properties
algo-1-nvqf7_1 | Inference address: http://0.0.0.0:8080
algo-1-nvqf7_1 | Management address: http://0.0.0.0:8080
algo-1-nvqf7_1 | Model Store: /.sagemaker/mms/models
algo-1-nvqf7_1 | Initial Models: ALL
algo-1-nvqf7_1 | Log dir: /logs
algo-1-nvqf7_1 | Metrics dir: /logs
algo-1-nvqf7_1 | Netty threads: 0
algo-1-nvqf7_1 | Netty client threads: 0
algo-1-nvqf7_1 | Default workers per model: 1
algo-1-nvqf7_1 | Blacklist Regex: N/A
algo-1-nvqf7_1 | Maximum Response Size: 6553500
algo-1-nvqf7_1 | Maximum Request Size: 6553500
algo-1-nvqf7_1 | Preload model: false
algo-1-nvqf7_1 | Prefer direct buffer: false
algo-1-nvqf7_1 | 2020-11-30 07:17:14,391 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model
algo-1-nvqf7_1 | 2020-11-30 07:17:14,481 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_pytorch_serving_container.handler_service --model-path /.sagemaker/mms/models/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp
algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 51
algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
algo-1-nvqf7_1 | 2020-11-30 07:17:14,483 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-nvqf7_1 | 2020-11-30 07:17:14,483 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded.
algo-1-nvqf7_1 | 2020-11-30 07:17:14,487 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
algo-1-nvqf7_1 | 2020-11-30 07:17:14,496 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
algo-1-nvqf7_1 | 2020-11-30 07:17:14,544 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-nvqf7_1 | 2020-11-30 07:17:14,545 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-nvqf7_1 | Model server started.
algo-1-nvqf7_1 | 2020-11-30 07:17:14,547 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
algo-1-nvqf7_1 | 2020-11-30 07:17:14,962 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.3.1 available.
algo-1-nvqf7_1 | 2020-11-30 07:17:15,314 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
algo-1-nvqf7_1 | 2020-11-30 07:17:15,315 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpcln39mxo
algo-1-nvqf7_1 | 2020-11-30 07:17:15,344 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]
algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt in cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
algo-1-nvqf7_1 | 2020-11-30 07:17:15,350 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Created tokenizer
algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Elastic Inference enabled: True
algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - inside model fn
algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model
algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model/model.pt
algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ['model.pt', 'model.tar.gz', 'code', 'model_tn_best.pth', 'MAR-INF']
algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading torch script
algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-44f1cd64 Worker disconnected. WORKER_STARTED
algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
algo-1-nvqf7_1 | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds.
algo-1-nvqf7_1 | 2020-11-30 07:17:16,065 [INFO ] W-9000-model ACCESS_LOG - /172.18.0.1:45110 "GET /ping HTTP/1.1" 200 8
algo-1-nvqf7|2020-11-30 07:17:14333[INFO]main com.amazonaws.ml.mms.ModelServer-
algo-1-nvqf7|1 |彩信主页:/opt/conda/lib/python3.6/site-packages
algo-1-nvqf7|当前目录:/
algo-1-nvqf7|1 |临时目录:/home/model server/tmp
algo-1-nvqf7| GPU数量:0
algo-1-nvqf7| CPU数量:8
algo-1-nvqf7|最大堆大小:6972 M
algo-1-nvqf7| Python可执行文件:/opt/conda/bin/Python
algo-1-nvqf7|1 |配置文件:/etc/sagemaker-mms.properties
algo-1-nvqf7|U 1 |推断地址:http://0.0.0.0:8080
algo-1-nvqf7|管理地址:http://0.0.0.0:8080
algo-1-nvqf7|1 |模型商店:/.sagemaker/mms/models
algo-1-nvqf7|初始模型:所有
algo-1-nvqf7|U 1 |日志目录:/logs
algo-1-nvqf7|U 1 |度量目录:/logs
algo-1-nvqf7|网状螺纹:0
algo-1-nvqf7|网络客户端线程:0
algo-1-nvqf7| 1 |每个模型的默认工作程序:1
algo-1-nvqf7|黑名单正则表达式:不适用
algo-1-nvqf7|U 1 |最大响应大小:6553500
algo-1-nvqf7|U 1 |最大请求大小:6553500
algo-1-nvqf7|U 1 |预加载模型:错误
algo-1-nvqf7|U 1 |首选直接缓冲区:false
algo-1-nvqf7|1 | 2020-11-30 07:17:14391[警告]W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle-attachIOStreams()threadName=W-9000-model
algo-1-nvqf7|2020-11-30 07:17:14481[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-model\u service\u worker从args开始--sock-type unix--sock-name/home/model-server/tmp/.mms.sock.9000--handler sagemaker\u pytorch\u serving\u container.handler\u服务--path/.sagemaker/mms/models/model---model-name-model---preload model-false--tmp dir/home/model/server/tmp
algo-1-nvqf7|1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-监听端口:/home/model server/tmp/.mms.sock.9000
algo-1-nvqf7|u 1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-[PID]51
algo-1-nvqf7|1 | 2020-11-30 07:17:14482[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-mms worker已启动。
algo-1-nvqf7|u 1 | 2020-11-30 07:17:14483[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-Python运行时:3.6.6
algo-1-nvqf7|2020-11-30 07:17:14483[信息]main com.amazonaws.ml.mms.wlm.ModelManager-模型已加载。
algo-1-nvqf7|1 | 2020-11-30 07:17:14487[INFO]main com.amazonaws.ml.mms.ModelServer-使用EpollServerSocketChannel初始化推理服务器。
algo-1-nvqf7|1 | 2020-11-30 07:17:14496[信息]W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread-连接到:/home/model server/tmp/.mms.sock.9000
algo-1-nvqf7|2020-11-30 07:17:14544[信息]main com.amazonaws.ml.mms.ModelServer-推理API绑定到:http://0.0.0.0:8080
algo-1-nvqf7|1 | 2020-11-30 07:17:14545[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-已接受连接:/home/model server/tmp/.mms.sock.9000。
algo-1-nvqf7|1 |模型服务器已启动。
algo-1-nvqf7|u 1 | 2020-11-30 07:17:14547[警告]pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector-工作pid尚不可用。
algo-1-nvqf7|1 | 2020-11-30 07:17:14962[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-PyTorch版本1.3.1可用。
algo-1-nvqf7|u 1 2020-11-30 07:17:15314[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-锁140580224398952在/root/.cache/torch/transformers/26BC1AD6C0AC742E9B523248F6D00068293B33709FAE12320C05CFBBB.542CE3285A40D23A59526243235DF47C575C197F04F37D1A0242C9A084.Lock上获取
algo-1-nvqf7|u 1 | 2020-11-30 07:17:15315[信息]W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle-https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt 在缓存中找不到或强制下载设置为True,下载到/root/.cache/torch/transformers/tmpcln39mxo
algo-1-nvqf7|u 1 | 2020-11-30 07:17:15344[警告]W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle-
algo-1-nvqf7|u 1 | 2020-11-30 07:17:15349[警告]W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle-下载