Prometheus 警报管理器未触发任何警报

Prometheus 警报管理器未触发任何警报,prometheus,prometheus-alertmanager,Prometheus,Prometheus Alertmanager,普罗米修斯已经发出警报,但并没有松懈的射击。Alertmanager说没有警报。我正在附上alertmanager和prometheus规则的配置文件 需要一些即时帮助,因为这是一个与生产相关的问题。 普罗米修斯规则 apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: null name: prometheus-rules-conf namespace: monitoring data: kubernetes_a

普罗米修斯已经发出警报,但并没有松懈的射击。Alertmanager说没有警报。我正在附上alertmanager和prometheus规则的配置文件

需要一些即时帮助,因为这是一个与生产相关的问题。 普罗米修斯规则

apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: prometheus-rules-conf
  namespace: monitoring
data:
  kubernetes_alerts.yml: |
    groups:
      - name: kubernetes_alerts
        rules:
        - alert: DeploymentGenerationOff
          expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment is outdated
        - alert: DeploymentReplicasNotUpdated
          expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
            or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
            unless (kube_deployment_spec_paused == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment replicas are outdated
        - alert: PodzFrequentlyRestarting
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
            summary: Pod is restarting frequently
        - alert: KubeNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 1h
          labels:
            severity: warning
          annotations:
            description: The Kubelet on {{ $labels.node }} has not checked in with the API,
              or has set itself to NotReady, for more than an hour
            summary: Node status is NotReady
        - alert: KubeManyNodezNotReady
          expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
            > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
            0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
          for: 1m
          labels:
            severity: critical
          annotations:
            description: '{{ $value }}% of Kubernetes nodes are not ready'
        - alert: APIHighLatency
          expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
          for: 10m
          labels:
            severity: critical
          annotations:
            description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
        - alert: APIServerErrorsHigh
          expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
          for: 10m
          labels:
            severity: critical
          annotations:
            description: API server returns errors for {{ $value }}% of requests
        - alert: KubernetesAPIServerDown
          expr: up{job="kubernetes-apiservers"} == 0
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: Apiserver {{ $labels.instance }} is down!
        - alert: KubernetesAPIServersGone
          expr:  absent(up{job="kubernetes-apiservers"})
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: No Kubernetes apiservers are reporting!
            description: Werner Heisenberg says - OMG Where are my apiserverz?
  prometheus_alerts.yml: |
    groups:
    - name: prometheus_alerts
      rules:
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
      - alert: PrometheusNotConnectedToAlertmanagers
        expr: prometheus_notifications_alertmanagers_discovered < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
  node_alerts.yml: |
    groups:
    - name: node_alerts
      rules:
      - alert: HighNodeCPU
        expr: instance:node_cpu:avg_rate5m > 80
        for: 10s
        labels:
          severity: warning
        annotations:
          summary: High Node CPU of {{ humanize $value}}% for 1 hour
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
      - alert: KubernetesServiceDown
        expr: up{job="kubernetes-service-endpoints"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Pod {{ $labels.instance }} is down!
      - alert: KubernetesServicesGone
        expr:  absent(up{job="kubernetes-service-endpoints"})
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: No Kubernetes services are reporting!
          description: Werner Heisenberg says - OMG Where are my servicez?
      - alert: CriticalServiceDown
        expr: node_systemd_unit_state{state="active"} != 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Service {{ $labels.name }} failed to start.
          description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
  proxy_alert.yml: |
    groups:
    - name: proxy_alert
      rules:
      - alert: Proxy_Down
        expr: probe_success{instance="http://ip",job="blackbox"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Proxy Server {{ $labels.instance }} is down!
      kubernetes_rules.yml: |
    groups:
      - name: kubernetes_rules
        rules:
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.99"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.9"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.5"


即使我在普罗米修斯上得到了端点,警报仍然没有触发。

问题是:你已经在普罗米修斯上设置了警报,但它不会触发事件。我收集了一些经验法则来验证您的警报设置是否正确、是否已更新,以及是否能在普罗米修斯仪表板上正常工作:

  • 确保在ec2机器中更新并部署了代码(快速浏览源代码)
  • 验证您的yml文件是否以正确的标识写入
  • 您的新警报文件以
    规则文件上的
    键和正确的内部路径中的
    prometheus.yml
    为目标
  • 确保在docker和Prometheus scraper中重新启动AlertManager服务(在日志中验证)
  • (为此,请使用)

    docker重启
    curl-X POST localhost:9090/-/reload
    
  • localhost:9090/alerts
    上找到您的新警报,并验证条件语句是否有效,是否应触发事件(事件->路由->接收器)
  • 接收方应使用正确有效的身份验证令牌连接第三方服务(用于空闲/寻呼机任务/jira等)

  • 问题:您已对普罗米修斯设置了警报,但不会触发事件。我收集了一些经验法则来验证您的警报设置是否正确、是否已更新,以及是否能在普罗米修斯仪表板上正常工作:

  • 确保在ec2机器中更新并部署了代码(快速浏览源代码)
  • 验证您的yml文件是否以正确的标识写入
  • 您的新警报文件以
    规则文件上的
    键和正确的内部路径中的
    prometheus.yml
    为目标
  • 确保在docker和Prometheus scraper中重新启动AlertManager服务(在日志中验证)
  • (为此,请使用)

    docker重启
    curl-X POST localhost:9090/-/reload
    
  • localhost:9090/alerts
    上找到您的新警报,并验证条件语句是否有效,是否应触发事件(事件->路由->接收器)
  • 接收方应使用正确有效的身份验证令牌连接第三方服务(用于空闲/寻呼机任务/jira等)

  • 是普罗米修斯发出的警报并没有发送给斯莱克吗?或者只是警报没有触发?我在你的prom配置中没有看到任何建议将这些警报转发给slack的设置?嘿,警报被触发了,但是我给出的这个自定义配置在普罗米修斯中没有反映出来。我需要关于“proxy\u alert.yml:| groups:”的帮助,这一部分在上面附加的yaml文件中。这些警报是在普罗米修斯发出的,而不是发送到slack吗?或者只是警报没有触发?我在你的prom配置中没有看到任何建议将这些警报转发给slack的设置?嘿,警报被触发了,但是我给出的这个自定义配置在普罗米修斯中没有反映出来。我需要关于“proxy”alert.yml:| groups:“上述yaml文件中的这一部分的帮助。
    alerting:
          alertmanagers:
          - kubernetes_sd_configs:
            - role: endpoints
            relabel_configs:
            - source_labels: [__meta_kubernetes_service_name]
              regex: alertmanager
              action: keep
            - source_labels: [__meta_kubernetes_namespace]
              regex: monitoring
              action: keep
            - source_labels: [__meta_kubernetes_pod_container_port_number]
              action: keep
              regex: 9093
    rule_files:
          - "/var/prometheus/rules/*_rules.yml"
          - "/var/prometheus/rules/*_alerts.yml"
    
    docker restart <alert-manager-service-name>
    curl -X POST localhost:9090/-/reload