Tensorflow 使用KEDA Prometheus metrics自动校准AKS-HPA无法获取指标
我们已成功地在Azure(AKS)上为我的图像处理应用程序部署了以下设置:Tensorflow 使用KEDA Prometheus metrics自动校准AKS-HPA无法获取指标,tensorflow,kubernetes,gpu,azure-aks,keda,Tensorflow,Kubernetes,Gpu,Azure Aks,Keda,我们已成功地在Azure(AKS)上为我的图像处理应用程序部署了以下设置: 带有1个GPU节点的AKS群集(需要根据传入流量进行扩展) 1个pod在GPU节点上运行Tensorflow模型(由于内存限制,每个节点最多1个pod) 普罗米修斯将削减GPU利用率指标(NVIDIA的DCGM出口商) 水平吊舱自动缩放器(HPA)的KEDA缩放对象-与我们的部署在同一命名空间中 查询:ceil(随时间的平均值(DCGM\u FI\u DEV\u GPU UTIL{namespace=“myprojec
- 带有1个GPU节点的AKS群集(需要根据传入流量进行扩展)
- 1个pod在GPU节点上运行Tensorflow模型(由于内存限制,每个节点最多1个pod)
- 普罗米修斯将削减GPU利用率指标(NVIDIA的DCGM出口商)
- 水平吊舱自动缩放器(HPA)的KEDA缩放对象-与我们的部署在同一命名空间中
- 查询:
ceil(随时间的平均值(DCGM\u FI\u DEV\u GPU UTIL{namespace=“myproject”}[2m])
http\u request\u total
,我们能够从所有节点获取度量,因此我们预计DCGM部分会出现错误,这是GPU度量所需要的。我们已经在DCGM exporter
名称空间中安装了DCGM,并在PrometheusAdditionalScrapConfig
中配置了它检查
如果您需要额外的信息以任何方式帮助,请也让我知道!
先谢谢你
kubectl describe hpa keda-hpa-prometheus-scaled-object -n myproject
Name: keda-hpa-prometheus-scaled-object
Namespace: myproject
Labels: app.kubernetes.io/managed-by=keda-operator
app.kubernetes.io/name=keda-hpa-prometheus-scaled-object
app.kubernetes.io/part-of=prometheus-scaled-object
app.kubernetes.io/version=2.0.0
deploymentName=myproject-deployment
scaledObjectName=prometheus-scaled-object
Annotations: <none>
CreationTimestamp: Thu, 22 Apr 2021 13:30:57 +0200
Reference: Deployment/myproject-deployment
Metrics: ( current / target )
"prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL" (target average value): 41 / 60
Min replicas: 1
Max replicas: 2
Deployment pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetExternalMetric the HPA was unable to compute the replica count: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 45m (x12 over 47m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL external metric: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
Warning FailedGetExternalMetric 2m55s (x178 over 47m) horizontal-pod-autoscaler unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
kubectl描述hpa凯达hpa普罗米修斯标度对象-n myproject
名称:科达hpa普罗米修斯标度对象
名称空间:myproject
标签:app.kubernetes.io/managed by=keda operator
app.kubernetes.io/name=keda hpa普罗米修斯标度对象
app.kubernetes.io/part of=普罗米修斯标度对象
app.kubernetes.io/version=2.0.0
deploymentName=myproject部署
scaledObjectName=普罗米修斯缩放对象
注释:
CreationTimestamp:2021年4月22日星期四13:30:57+0200
参考:部署/myproject部署
指标:(当前/目标)
“普罗米修斯http--XX-X-XXX-XXX-9090--DCGM\u FI\u DEV\u GPU\u UTIL”(目标平均值):41/60
最小副本数:1
最大副本数:2
部署吊舱:2个当前吊舱/2个所需吊舱
条件:
键入状态原因消息
---- ------ ------ -------
能够缩放真实成功缩放HPA控制器能够获取目标的当前缩放
ScalingActive False FailedGetExternalMetric HPA无法计算副本计数:无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{ScaleObjectName:prometheus scaled object,},匹配表达式:[]LabelSelectorRequest}:无法从外部度量API获取度量:找不到prometheus http--XX-X-XXX-XXX-9090--dcgm\u fi\u dev\u gpu util的匹配度量
ScalingLimited False DesiredWithinRange所需计数在可接受范围内
活动:
从消息中键入原因年龄
---- ------ ---- ---- -------
警告FailedComputeMetricsReplicates 45m(x12/47m)水平吊舱自动缩放器无效度量(1个无效,1个无效),第一个错误是:无法获取prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU UTIL外部度量:无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI elsu DEV GPU UTIL/&Labelector{MatchLabels:map[string]string{scaledObjectName:prometheus scaled object,},MatchExpressions:[]LabelSelectorRequirement{},}:无法从外部度量API获取度量值:找不到prometheus http--XX-X-XXX-XXX-9090--dcgm_fi u dev_gpu util的匹配度量值
警告FailedGetExternalMetric 2m55s(x178超过47米)水平吊舱自动缩放器无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName:prometheus scaled object,},匹配表达式:[]LabelSelectorRequirement{}:无法从外部度量API获取度量:找不到的匹配度量