定义每个服务具有不同警报阈值的共享Prometheus警报

定义每个服务具有不同警报阈值的共享Prometheus警报,prometheus,prometheus-alertmanager,Prometheus,Prometheus Alertmanager,我定义了一些带有如下表达式的警报: sum(rate(some_error_metric[1m])) BY (namespace,application) > 10 sum(rate(some_other_error_metric[1m])) BY (namespace,application) > 10 ... 当前,当我们的任何应用程序以每分钟超过10的速率发出这些指标时,就会发出上述警报 我希望能够为每个应用程序指定不同的阈值,而不是硬编码阈值10 e、 g.applicat

我定义了一些带有如下表达式的警报:

sum(rate(some_error_metric[1m])) BY (namespace,application) > 10
sum(rate(some_other_error_metric[1m])) BY (namespace,application) > 10
...
当前,当我们的任何应用程序以每分钟超过10的速率发出这些指标时,就会发出上述警报

我希望能够为每个应用程序指定不同的阈值,而不是硬编码阈值10

e、 g.
application_1
应以每分钟10次的速率发出警报,
application_2
应以每分钟20次的速率发出警报,以此类推

是否可以在不复制每个应用程序的警报的情况下执行此操作

这个stackoverflow问题:表明使用记录规则可能实现我想要的,但是按照这个问题的唯一答案中建议的模式会产生普罗米修斯似乎无法解析的记录规则:

  - record: application_1_warning_threshold
    expr: warning_threshold{application="application_1"} 10
  - record: application_2_warning_threshold
    expr: warning_threshold{application="application_2"} 20
  ...

以下是我对
任务处理
警报的配置,每个作业的阈值不同:

groups:
- name: availability.rules
  rules:

  # Expected number of tasks per job and environment.
  - record: job_env:up:count
    expr: count(up) without (instance)

  # Actually up and running tasks per job and environment.
  - record: job_env:up:sum
    expr: sum(up) without (instance)

  # Ratio of up and running to expected tasks per job and environment.
  - record: job_env:up:ratio
    expr: job_env:up:sum / job_env:up:count

  # Global warning and critical availability ratio thresholds.
  - record: job:up:ratio_warning_threshold
    expr: 0.7
  - record: job:up:ratio_critical_threshold
    expr: 0.5


  # Job-specific warning and critical availability ratio thresholds.

  # Always alert if one Prometheus instance is down.
  - record: job:up:ratio_critical_threshold
    labels:
      job: prometheus
    expr: 0.99

  # Never alert for some-batch-job instances down:
  - record: job:up:ratio_warning_threshold
    labels:
      job: some-batch-job
    expr: 0
  - record: job:up:ratio_critical_threshold
    labels:
      job: some-batch-job
    expr: 0


  # TasksMissing is fired when a certain percentage of tasks belonging to a job are down. Namely:
  #
  #     job_env:up:ratio < job:up:ratio_(warning|critical)_threshold
  #
  # with a job-specific warning/critical threshold when defined, or the global default otherwise.

  - alert: TasksMissing
    expr: |
      # Default warning threshold is < 70%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_warning_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_warning_threshold{job=""}
        )
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'

  - alert: TasksMissing
    expr: |
      # Default critical threshold is < 50%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_critical_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_critical_threshold{job=""}
        )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'

组:
-名称:availability.rules
规则:
#每个作业和环境的预期任务数。
-记录:作业环境:向上:计数
expr:count(向上)无(实例)
#每个作业和环境实际启动和运行任务。
-记录:工作环境:总计:总计
expr:sum(总计)而不包含(实例)
#每个作业和环境的启动和运行任务与预期任务的比率。
-记录:工作环境:上升:比率
expr:job\u env:up:sum/job\u env:up:count
#全局警告和关键可用性比率阈值。
-记录:作业:上升:比率\u警告\u阈值
表达式:0.7
-记录:作业:上升:比率\u临界\u阈值
表达式:0.5
#特定于作业的警告和关键可用性比率阈值。
#如果一个普罗米修斯实例关闭,请始终保持警惕。
-记录:作业:上升:比率\u临界\u阈值
标签:
工作:普罗米修斯
表达式:0.99
#从不警告某些批处理作业实例已关闭:
-记录:作业:上升:比率\u警告\u阈值
标签:
作业:一些批处理作业
表达式:0
-记录:作业:上升:比率\u临界\u阈值
标签:
作业:一些批处理作业
表达式:0
#当属于某个作业的任务减少一定百分比时,会触发TasksMissing。即:
#
#作业环境:上升:比率<作业:上升:比率(警告关键)\阈值
#
#定义时使用特定于作业的警告/严重阈值,否则使用全局默认值。
-警报:任务处理
表达式:|
#默认警告阈值小于70%
工作环境:上升:比率
以下是我对
任务处理
警报的配置,每个作业的阈值不同:

groups:
- name: availability.rules
  rules:

  # Expected number of tasks per job and environment.
  - record: job_env:up:count
    expr: count(up) without (instance)

  # Actually up and running tasks per job and environment.
  - record: job_env:up:sum
    expr: sum(up) without (instance)

  # Ratio of up and running to expected tasks per job and environment.
  - record: job_env:up:ratio
    expr: job_env:up:sum / job_env:up:count

  # Global warning and critical availability ratio thresholds.
  - record: job:up:ratio_warning_threshold
    expr: 0.7
  - record: job:up:ratio_critical_threshold
    expr: 0.5


  # Job-specific warning and critical availability ratio thresholds.

  # Always alert if one Prometheus instance is down.
  - record: job:up:ratio_critical_threshold
    labels:
      job: prometheus
    expr: 0.99

  # Never alert for some-batch-job instances down:
  - record: job:up:ratio_warning_threshold
    labels:
      job: some-batch-job
    expr: 0
  - record: job:up:ratio_critical_threshold
    labels:
      job: some-batch-job
    expr: 0


  # TasksMissing is fired when a certain percentage of tasks belonging to a job are down. Namely:
  #
  #     job_env:up:ratio < job:up:ratio_(warning|critical)_threshold
  #
  # with a job-specific warning/critical threshold when defined, or the global default otherwise.

  - alert: TasksMissing
    expr: |
      # Default warning threshold is < 70%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_warning_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_warning_threshold{job=""}
        )
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'

  - alert: TasksMissing
    expr: |
      # Default critical threshold is < 50%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_critical_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_critical_threshold{job=""}
        )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'

组:
-名称:availability.rules
规则:
#每个作业和环境的预期任务数。
-记录:作业环境:向上:计数
expr:count(向上)无(实例)
#每个作业和环境实际启动和运行任务。
-记录:工作环境:总计:总计
expr:sum(总计)而不包含(实例)
#每个作业和环境的启动和运行任务与预期任务的比率。
-记录:工作环境:上升:比率
expr:job\u env:up:sum/job\u env:up:count
#全局警告和关键可用性比率阈值。
-记录:作业:上升:比率\u警告\u阈值
表达式:0.7
-记录:作业:上升:比率\u临界\u阈值
表达式:0.5
#特定于作业的警告和关键可用性比率阈值。
#如果一个普罗米修斯实例关闭,请始终保持警惕。
-记录:作业:上升:比率\u临界\u阈值
标签:
工作:普罗米修斯
表达式:0.99
#从不警告某些批处理作业实例已关闭:
-记录:作业:上升:比率\u警告\u阈值
标签:
作业:一些批处理作业
表达式:0
-记录:作业:上升:比率\u临界\u阈值
标签:
作业:一些批处理作业
表达式:0
#当属于某个作业的任务减少一定百分比时,会触发TasksMissing。即:
#
#作业环境:上升:比率<作业:上升:比率(警告关键)\阈值
#
#定义时使用特定于作业的警告/严重阈值,否则使用全局默认值。
-警报:任务处理
表达式:|
#默认警告阈值小于70%
工作环境:上升:比率
同样的问题:

这是应用程序_10000、应用程序_10001、应用程序_10