Prometheus 普罗米修斯-使用telegraf http_响应插件创建http错误警报_Prometheus_Telegraf_Prometheus Alertmanager_Telegraf Inputs Plugin

Prometheus 普罗米修斯-使用telegraf http_响应插件创建http错误警报

prometheus

Prometheus 普罗米修斯-使用telegraf http_响应插件创建http错误警报,prometheus,telegraf,prometheus-alertmanager,telegraf-inputs-plugin,Prometheus,Telegraf,Prometheus Alertmanager,Telegraf Inputs Plugin,我正在使用Telegraf和Prometheus监控我的本地服务，例如OpenHab和我的Grafana实例 http_响应插件可能会产生以下结果： http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.

我正在使用Telegraf和Prometheus监控我的本地服务，例如OpenHab和我的Grafana实例

http_响应插件可能会产生以下结果：

http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.grafana.local",status_code="200"}    200
http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.grafana.local",status_code="502"}    502
http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.thuis.local/start/index",status_code="200"} 200

现在我需要一个警报，它会在任何时候通知我！过去30分钟的200状态代码计数高于200状态代码计数

我从简单开始：

alert: service_down_external
expr: http_response_http_response_code{status_code!~"200|302"}
for: 35m
labels:
  severity: high

这很好，但问题是这对我的服务不起作用，我不是每10秒监控一次，而是每5到30分钟监控一次（因为我想减少一些API的负载）

所以我想，让我们换一种方式试试：

expr: count_over_time(http_response_http_response_code{status_code!~"200|302"}[30m]) > on(job, instance, method, server) count_over_time(http_response_http_response_code{status_code=~"200|302"}[30m])

这似乎是有希望的，但不幸的是，如果根本没有200/302响应，那么就无法工作，在这种情况下，将返回“无数据”

所以我想，让我们把它除以总量：

count_over_time(http_response_http_response_code{status_code!~"200|302"}[300m]) > on(job, instance, method, server) count_over_time(http_response_http_response_code[300m])

但是，这会导致：

执行查询时出错：找到匹配组{instance=“192.168.2.15:9126”、job=“telegraf master pi”、method=“GET”、server=”的重复序列http://www.grafana.local/series“}在操作的右侧：[{host=“master pi”，instance=“192.168.2.15:9126”，job=“telegraf master pi”，method=“GET”，result=“success”，result\u type=“成功”，服务器=“http://www.grafana.local/series“，status_code=“502”}，{host=“master pi”，instance=“192.168.2.15:9126”，job=“telegraf master pi”，method=“GET”，result=“success”，result\u type=“success”，server=”http://www.grafana.local/series“，status_code=“200”}]；不允许多对多匹配：匹配标签的一侧必须唯一

此外，在尝试忽略时：

count_over_time(http_response_http_response_code{status_code!~"200|302"}[30m]) >ignoring(status_code) count_over_time(http_response_http_response_code[30m])

同样的错误也会发生

当http响应在过去30分钟内仅返回5xx个错误时，是否有其他方法提醒我？

6个月后，又一次尝试解决此问题，我最终提出了一个查询，该查询给出了预期的结果：

count_over_time(http_response_result_code{result!~"success"}[2h]) / on(job, instance, method, server, type) group_left() sum by(job, instance, method, server, type) (count_over_time(http_response_result_code[2h])) >= 0.5

“部分求和”解决了“为匹配组找到重复序列”的问题，因为它将对所有重复序列求和（例如，具有“response\u string\u mismatch”和“success”的所有结果）

组_left选择查询的左侧部分，以便我仍然可以在警报中使用结果类型标签。右侧部分仅包含sum by中提到的5个字段

最后，查询将给出在过去2小时内未成功的呼叫的百分比，这正是我所需要的