Amazon web services AWS Cloudwatch公制报警在第一次后未触发

Amazon web services AWS Cloudwatch公制报警在第一次后未触发,amazon-web-services,amazon-cloudformation,amazon-cloudwatch,amazon-cloudwatchlogs,Amazon Web Services,Amazon Cloudformation,Amazon Cloudwatch,Amazon Cloudwatchlogs,我有一个警报在日志中查找错误消息,它确实触发了警报状态。但它不会被重置,并保持在报警状态下的。我把报警动作作为SNS主题,这反过来会触发电子邮件。所以基本上在第一个错误之后,我看不到任何后续的电子邮件。下面的模板配置出了什么问题 "AppErrorMetric": { "Type": "AWS::Logs::MetricFilter", "Properties": { "LogGroupName": { "Ref": "AppServerLG" },

我有一个警报在日志中查找
错误
消息,它确实触发了警报状态。但它不会被重置,并保持在报警状态下的
。我把报警动作作为SNS主题,这反过来会触发电子邮件。所以基本上在第一个错误之后,我看不到任何后续的电子邮件。下面的模板配置出了什么问题

"AppErrorMetric": {
  "Type": "AWS::Logs::MetricFilter",
  "Properties": {
    "LogGroupName": {
      "Ref": "AppServerLG"
    },
    "FilterPattern": "[error]",
    "MetricTransformations": [
      {
        "MetricValue": "1",
        "MetricNamespace": {
          "Fn::Join": [
            "",
            [
              {
                "Ref": "ApplicationEndpoint"
              },
              "/metrics/AppError"
            ]
          ]
        },
        "MetricName": "AppError"
      }
    ]
  }
},
"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
    "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

您的问题是两个因素的组合:

  • 您的度量仅在发现错误时发出,它是一个稀疏度量,因此错误时会出现1,但如果不存在错误,则不会发出0
  • 默认情况下,CloudWatch报警配置为
    TreatMissingData
    missing
  • 说:

    对于每个报警,您可以指定CloudWatch来处理丢失的数据 分为以下任一点:

    • 不违反–缺失的数据点被视为“良好”且在阈值范围内
    • 违反–丢失的数据点被视为“坏”并违反阈值
    • 忽略–保持当前报警状态
    • 在评估是否改变状态时,警报不考虑丢失的数据点。
    在报警配置中添加
    “TreatMissing”:“NotBreaking”
    参数将导致CloudWatch将丢失的数据点视为未违反,并将报警转换为OK:

    "AppErrorAlarm": {
            "Type": "AWS::CloudWatch::Alarm",
            "Properties": {
                "ActionsEnabled": "true",
                "AlarmName": {
                    "Fn::Join": [
                        "",
                        [
                            {
                                "Ref": "AppId"
                            },
                            ",",
                            {
                                "Ref": "AppServerAG"
                            },
                            ":",
                            "AppError",
                            ",",
                            "MINOR"
                        ]
                    ]
                },
                "AlarmDescription": {
                    "Fn::Join": [
                        "",
                        [
                            "service is throwing error. Please check logs.",
                            {
                                "Ref": "AppServerAG"
                            },
                            "-",
                            {
                                "Ref": "AppId"
                            }
                        ]
                    ]
                },
                "MetricName": "AppError",
                "Namespace": {
                    "Fn::Join": [
                        "",
                        [
                            {
                                "Ref": "ApplicationEndpoint"
                            },
                            "metrics/AppError"
                        ]
                    ]
                },
                "Statistic": "Sum",
                "Period": "300",
                "EvaluationPeriods": "1",
                "Threshold": "1",
                "TreatMissingData": "notBreaching",
                "AlarmActions": [{
                  "Fn::GetAtt": [
                    "VPCInfo",
                    "SNSTopic"
                  ]
                }],
                "ComparisonOperator": "GreaterThanOrEqualToThreshold"
            }
    }