Google cloud platform 无法使用自定义预测例程将经过培训的模型部署到Google云Ai平台：模型需要的内存超过允许的内存_Google Cloud Platform_Pytorch_Google Cloud Ml_Gcp Ai Platform Training

Google cloud platform 无法使用自定义预测例程将经过培训的模型部署到Google云Ai平台：模型需要的内存超过允许的内存

google-cloud-platform pytorch

Google cloud platform 无法使用自定义预测例程将经过培训的模型部署到Google云Ai平台：模型需要的内存超过允许的内存,google-cloud-platform,pytorch,google-cloud-ml,gcp-ai-platform-training,Google Cloud Platform,Pytorch,Google Cloud Ml,Gcp Ai Platform Training,我正试图用一个定制的预测例程将一个经过预训练的Pytork部署到AI平台。按照描述的说明操作后，部署失败，出现以下错误： ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-

我正试图用一个定制的预测例程将一个经过预训练的Pytork部署到AI平台。按照描述的说明操作后，部署失败，出现以下错误：

ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.

模型文件夹的内容83.89 MB大，并且低于文档中描述的250 MB限制。文件夹中唯一的文件是模型的检查点文件（.pth）和自定义预测例程所需的tarball

用于创建模型的命令：

gcloud beta ai-platform versions create pose_pytorch --model pose --runtime-version 1.15 --python-version 3.5 --origin gs://rcg-models/pytorch_pose_estimation --package-uris gs://rcg-models/pytorch_pose_estimation/my_custom_code-0.1.tar.gz --prediction-class predictor.MyPredictor

将运行时版本更改为

1.14

会导致相同的错误。我已经尝试过像Parth建议的那样将--machine type参数更改为

mls1-c4-m2

，但仍然得到相同的错误

生成

my_custom_code-0.1.tar.gz

的

setup.py

文件如下所示：

setup(
    name='my_custom_code',
    version='0.1',
    scripts=['predictor.py'],
    install_requires=["opencv-python", "torch"]
)

gcloud beta ai-platform versions create v17 \
    --model=newest \
    --origin=gs://bucket \
    --runtime-version=1.15 \
    --python-version=3.7 \
    --package-uris=gs://bucket/predictor-0.1.tar.gz,gs://bucket/torch-1.3.0+cpu-cp37-cp37m-linux_x86_64.whl \
    --prediction-class=predictor.MyPredictor \
    --machine-type=mls1-c4-m4

预测器中的相关代码段：

    def __init__(self, model):
        """Stores artifacts for prediction. Only initialized via `from_path`.
        """
        self._model = model
        self._client = storage.Client()

    @classmethod
    def from_path(cls, model_dir):
        """Creates an instance of MyPredictor using the given path.

        This loads artifacts that have been copied from your model directory in
        Cloud Storage. MyPredictor uses them during prediction.

        Args:
            model_dir: The local directory that contains the trained Keras
                model and the pickled preprocessor instance. These are copied
                from the Cloud Storage model directory you provide when you
                deploy a version resource.

        Returns:
            An instance of `MyPredictor`.
        """

        net = PoseEstimationWithMobileNet()
        checkpoint_path = os.path.join(model_dir, "checkpoint_iter_370000.pth")
        checkpoint = torch.load(checkpoint_path, map_location='cpu')
        load_state(net, checkpoint)

        return cls(net)

此外，我已在AI平台中为模型启用日志记录，并获得以下输出：

2019-12-17T09:28:06.208537Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k 
2019-12-17T09:28:13.474653Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:48: The name tf.saved_model.tag_constants.SERVING is deprecated. Please use tf.saved_model.SERVING instead. 
2019-12-17T09:28:13.474680Z {"textPayload":"","insertId":"5df89fad00073e383ced472a","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474680Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474807Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:50: The name tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead. 
2019-12-17T09:28:13.474829Z {"textPayload":"","insertId":"5df89fad00073ecd4836d6aa","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474829Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474918Z WARNING:tensorflow: 
2019-12-17T09:28:13.474927Z The TensorFlow contrib module will not be included in TensorFlow 2.0. 
2019-12-17T09:28:13.474934Z For more information, please see: 
2019-12-17T09:28:13.474941Z   * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md 
2019-12-17T09:28:13.474951Z   * https://github.com/tensorflow/addons 
2019-12-17T09:28:13.474958Z   * https://github.com/tensorflow/io (for I/O related ops) 
2019-12-17T09:28:13.474964Z If you depend on functionality not listed there, please file an issue. 
2019-12-17T09:28:13.474999Z {"textPayload":"","insertId":"5df89fad00073f778735d7c3","resource":{"type":"cloudml_model_version","labels":{"version_id":"lightweight_pose_pytorch","model_id":"pose","project_id":"rcg-shopper","region":""}},"timestamp":"2019-12-17T09:28:13.474999Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:15.283483Z ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x 
2019-12-17T09:28:16.890923Z Copying gs://cml-489210249453-1560169483791188/models/pose/lightweight_pose_pytorch/15316451609316207868/user_code/my_custom_code-0.1.tar.gz... 
2019-12-17T09:28:16.891150Z / [0 files][    0.0 B/  8.4 KiB]                                                 
2019-12-17T09:28:17.007684Z / [1 files][  8.4 KiB/  8.4 KiB]                                                 
2019-12-17T09:28:17.009154Z Operation completed over 1 objects/8.4 KiB.                                       
2019-12-17T09:28:18.953923Z Processing /tmp/custom_code/my_custom_code-0.1.tar.gz 
2019-12-17T09:28:19.808897Z Collecting opencv-python 
2019-12-17T09:28:19.868579Z   Downloading https://files.pythonhosted.org/packages/d8/38/60de02a4c9013b14478a3f681a62e003c7489d207160a4d7df8705a682e7/opencv_python-4.1.2.30-cp37-cp37m-manylinux1_x86_64.whl (28.3MB) 
2019-12-17T09:28:21.537989Z Collecting torch 
2019-12-17T09:28:21.552871Z   Downloading https://files.pythonhosted.org/packages/f9/34/2107f342d4493b7107a600ee16005b2870b5a0a5a165bdf5c5e7168a16a6/torch-1.3.1-cp37-cp37m-manylinux1_x86_64.whl (734.6MB) 
2019-12-17T09:28:52.401619Z Collecting numpy>=1.14.5 
2019-12-17T09:28:52.412714Z   Downloading https://files.pythonhosted.org/packages/9b/af/4fc72f9d38e43b092e91e5b8cb9956d25b2e3ff8c75aed95df5569e4734e/numpy-1.17.4-cp37-cp37m-manylinux1_x86_64.whl (20.0MB) 
2019-12-17T09:28:53.550662Z Building wheels for collected packages: my-custom-code 
2019-12-17T09:28:53.550689Z   Building wheel for my-custom-code (setup.py): started 
2019-12-17T09:28:54.212558Z   Building wheel for my-custom-code (setup.py): finished with status 'done' 
2019-12-17T09:28:54.215365Z   Created wheel for my-custom-code: filename=my_custom_code-0.1-cp37-none-any.whl size=7791 sha256=fd9ecd472a6a24335fd24abe930a4e7d909e04bdc4cf770989143d92e7023f77 
2019-12-17T09:28:54.215482Z   Stored in directory: /tmp/pip-ephem-wheel-cache-i7sb0bmb/wheels/0d/6e/ba/bbee16521304fc5b017fa014665b9cae28da7943275a3e4b89 
2019-12-17T09:28:54.222017Z Successfully built my-custom-code 
2019-12-17T09:28:54.650218Z Installing collected packages: numpy, opencv-python, torch, my-custom-code

通过调整

setup.py

，我可以成功。基本上，

install\u需要

尝试获取PyPI托管的

torch

包，这是一个巨大的GPU轮子，超过了部署配额。下面的

setup.py

注入安装命令，从官方的pytorch索引获取CPU构建的torch

从setuptools导入设置中，查找\u包
从setuptools.command.install导入安装为\u安装
安装_REQUIRES=['pillow']
自定义安装命令=[
#在这里安装火炬。
[
“python默认值”、“-m”、“pip”、“安装”、“--target=/tmp/custom_lib”，
“-b”，“/tmp/pip_builds”，“torch==1.4.0+cpu”，“torchvision==0.5.0+cpu”，
'-f'，'https://download.pytorch.org/whl/torch_stable.html'
],
]
类安装（\u安装）：
def运行（自）：
导入系统
如果sys.platform==“linux”：
导入子流程
导入日志记录
对于自定义安装命令中的命令：
logging.info（'自定义命令：'+''.join（命令））
结果=subprocess.run(
命令，check=True，stdout=subprocess.PIPE
)
logging.info（result.stdout.decode（'utf-8'，'ignore'））
_安装。运行（自我）
设置(
name='predictor'，
version='0.1'，
packages=find_packages（），
install\u requires=install\u requires，
cmdclass={'install'：install}，
)

在经历了几个小时的良好尝试错误后，我得出了与@kyamagu相同的结论，“

安装需要

尝试获取PyPI托管的torch软件包，这是一个巨大的GPU车轮，超过了部署配额。”

然而，他的解决方案对我不起作用。因此，在经历了多个小时的尝试错误后（由于缺少文档和错误文档），我想出了以下解决方案：

我们需要得到Pytorch的cpu构建轮子，它大约是100 MBs，而不是默认PyPI托管的700 MBs GPU构建轮子。你可以在这里找到它们：

接下来，我们需要将它们放在gs存储中，然后将路径作为--包URI的一部分给出，如下所示：

setup(
    name='my_custom_code',
    version='0.1',
    scripts=['predictor.py'],
    install_requires=["opencv-python", "torch"]
)

gcloud beta ai-platform versions create v17 \
    --model=newest \
    --origin=gs://bucket \
    --runtime-version=1.15 \
    --python-version=3.7 \
    --package-uris=gs://bucket/predictor-0.1.tar.gz,gs://bucket/torch-1.3.0+cpu-cp37-cp37m-linux_x86_64.whl \
    --prediction-class=predictor.MyPredictor \
    --machine-type=mls1-c4-m4

另外，请注意

包URI的顺序，预测器
包应位于第一位，逗号后不应有空格
希望这有帮助。干杯
 这是一个常见问题，我们理解这是一个难点。请执行以下操作：
torchvision
将torch
作为依赖项，默认情况下，它从pypi中提取torch
部署模型时，即使您指向使用自定义ai平台torchvision
软件包，它也会这样做，因为torchvision
由PyTorch团队构建时，它被配置为使用torch
作为依赖项。这个来自pypi的torch
依赖项提供了一个720mb的文件，因为它包含GPU单元
要解决#1，您需要从源代码torchvision
并告诉torchvision
您想从哪里获得torch
，您需要将其设置为转到torch
网站，因为软件包较小。使用Python功能重新构建torchvision二进制文件。在torchvision
中，我们有：
更新torchvision
中的setup.py
以使用直接参考功能：
requirements = [
     #'numpy',
     #'six',
     #pytorch_dep,
     'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl'
]

*我已经为您做了这件事*，因此我构建了3个轮子文件，您可以使用：
gs://dpe-sandbox/torchvision-0.4.0-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.0)
gs://dpe-sandbox/torchvision-0.4.2-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.2)
gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl (torch 1.4.0  vision 0.5.0)

这些torchvision
包将从torch站点而不是pypi获得torch
：）
将模型部署到AI平台时更新模型setup.py
，使其不包括torch
或torchvision

按如下方式重新部署模型：
您可以将PYTORCH_VISION_软件包更改为我在#2
中提到的任何选项，您可以尝试使用a）--机器类型=mls-c4-m2 b）--运行时版本1.14 c）一起尝试a）和b）吗？我更新了我的问题，以包含您的建议。很抱歉输入错误，机器类型应为mls1-c4-m2。请在所有情况下试用。仍然会出现相同的内存错误。当您测试模型时，局部预测是否正常？我将查看Pytorch模型的详细信息，看看是否可以复制。我也有同样的问题，但当我根据您的设置更改setup.py时，我得到以下信息：错误：（gcloud.beta.ai platform.versions.create）创建版本失败。检测到错误的模型：“处理用户代码时出现问题：predictor.MyPredictor找不到。请确保（1）prediction_类是完全限定的函数名。我还使用了更大的计算机：mls1-c4-m4。几周后，自定义prediction将支持我们支持的其他Google计算引擎实例。