Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/tensorflow/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Tensorflow 使用KEDA Prometheus metrics自动校准AKS-HPA无法获取指标_Tensorflow_Kubernetes_Gpu_Azure Aks_Keda - Fatal编程技术网

Tensorflow 使用KEDA Prometheus metrics自动校准AKS-HPA无法获取指标

Tensorflow 使用KEDA Prometheus metrics自动校准AKS-HPA无法获取指标,tensorflow,kubernetes,gpu,azure-aks,keda,Tensorflow,Kubernetes,Gpu,Azure Aks,Keda,我们已成功地在Azure(AKS)上为我的图像处理应用程序部署了以下设置: 带有1个GPU节点的AKS群集(需要根据传入流量进行扩展) 1个pod在GPU节点上运行Tensorflow模型(由于内存限制,每个节点最多1个pod) 普罗米修斯将削减GPU利用率指标(NVIDIA的DCGM出口商) 水平吊舱自动缩放器(HPA)的KEDA缩放对象-与我们的部署在同一命名空间中 查询:ceil(随时间的平均值(DCGM\u FI\u DEV\u GPU UTIL{namespace=“myprojec

我们已成功地在Azure(AKS)上为我的图像处理应用程序部署了以下设置:

  • 带有1个GPU节点的AKS群集(需要根据传入流量进行扩展)
  • 1个pod在GPU节点上运行Tensorflow模型(由于内存限制,每个节点最多1个pod)
  • 普罗米修斯将削减GPU利用率指标(NVIDIA的DCGM出口商)
  • 水平吊舱自动缩放器(HPA)的KEDA缩放对象-与我们的部署在同一命名空间中
  • 查询:
    ceil(随时间的平均值(DCGM\u FI\u DEV\u GPU UTIL{namespace=“myproject”}[2m])
部署基于:

通过此设置,它可以自动缩放吊舱(水平)基于DCGM GPU利用率指标,从1个pod到2个pod。因此,群集自动缩放器会跟踪并将群集中的GPU节点数从1个缩放到2个。所需的新pod会成功分配给该新节点,平均GPU利用率会降低。但是,在新添加的节点和第二个pod分配给该节点后KEDA HPA对象无法再获取外部GPU指标。因此,HPA对象无法工作,也无法缩小POD,因此POD(和节点)的数量仍然为2

两个节点上的所有pod和服务看起来都很健康。此外,DCGM导出器在新节点上运行,因此它应该能够从该节点获取度量

在我描述HPA时,有人有过这样的经验吗?或者知道如何调试吗

如果我们使用DCGM之外的其他度量,例如
http\u request\u total
,我们能够从所有节点获取度量,因此我们预计DCGM部分会出现错误,这是GPU度量所需要的。我们已经在
DCGM exporter
名称空间中安装了DCGM,并在Prometheus
AdditionalScrapConfig
中配置了它检查

如果您需要额外的信息以任何方式帮助,请也让我知道! 先谢谢你

 kubectl describe hpa keda-hpa-prometheus-scaled-object -n myproject

Name:                                                                                  keda-hpa-prometheus-scaled-object
Namespace:                                                                             myproject
Labels:                                                                                app.kubernetes.io/managed-by=keda-operator
                                                                                       app.kubernetes.io/name=keda-hpa-prometheus-scaled-object
                                                                                       app.kubernetes.io/part-of=prometheus-scaled-object
                                                                                       app.kubernetes.io/version=2.0.0
                                                                                       deploymentName=myproject-deployment
                                                                                       scaledObjectName=prometheus-scaled-object
Annotations:                                                                           <none>
CreationTimestamp:                                                                     Thu, 22 Apr 2021 13:30:57 +0200
Reference:                                                                             Deployment/myproject-deployment
Metrics:                                                                               ( current / target )
  "prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL" (target average value):  41 / 60
Min replicas:                                                                          1
Max replicas:                                                                          2
Deployment pods:                                                                       2 current / 2 desired
Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                   -------
  AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                        Age                    From                       Message
  ----     ------                        ----                   ----                       -------
  Warning  FailedComputeMetricsReplicas  45m (x12 over 47m)     horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL external metric: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
  Warning  FailedGetExternalMetric       2m55s (x178 over 47m)  horizontal-pod-autoscaler  unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
kubectl描述hpa凯达hpa普罗米修斯标度对象-n myproject
名称:科达hpa普罗米修斯标度对象
名称空间:myproject
标签:app.kubernetes.io/managed by=keda operator
app.kubernetes.io/name=keda hpa普罗米修斯标度对象
app.kubernetes.io/part of=普罗米修斯标度对象
app.kubernetes.io/version=2.0.0
deploymentName=myproject部署
scaledObjectName=普罗米修斯缩放对象
注释:
CreationTimestamp:2021年4月22日星期四13:30:57+0200
参考:部署/myproject部署
指标:(当前/目标)
“普罗米修斯http--XX-X-XXX-XXX-9090--DCGM\u FI\u DEV\u GPU\u UTIL”(目标平均值):41/60
最小副本数:1
最大副本数:2
部署吊舱:2个当前吊舱/2个所需吊舱
条件:
键入状态原因消息
----            ------  ------                   -------
能够缩放真实成功缩放HPA控制器能够获取目标的当前缩放
ScalingActive False FailedGetExternalMetric HPA无法计算副本计数:无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{ScaleObjectName:prometheus scaled object,},匹配表达式:[]LabelSelectorRequest}:无法从外部度量API获取度量:找不到prometheus http--XX-X-XXX-XXX-9090--dcgm\u fi\u dev\u gpu util的匹配度量
ScalingLimited False DesiredWithinRange所需计数在可接受范围内
活动:
从消息中键入原因年龄
----     ------                        ----                   ----                       -------
警告FailedComputeMetricsReplicates 45m(x12/47m)水平吊舱自动缩放器无效度量(1个无效,1个无效),第一个错误是:无法获取prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU UTIL外部度量:无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI elsu DEV GPU UTIL/&Labelector{MatchLabels:map[string]string{scaledObjectName:prometheus scaled object,},MatchExpressions:[]LabelSelectorRequirement{},}:无法从外部度量API获取度量值:找不到prometheus http--XX-X-XXX-XXX-9090--dcgm_fi u dev_gpu util的匹配度量值
警告FailedGetExternalMetric 2m55s(x178超过47米)水平吊舱自动缩放器无法获取外部度量myproject/prometheus http--XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName:prometheus scaled object,},匹配表达式:[]LabelSelectorRequirement{}:无法从外部度量API获取度量:找不到的匹配度量