Stackdriver Monitoring Automation Part 2: Alerting Policies

Charles
Google Cloud - Community
6 min readSep 28, 2018

--

This post is part 2 in the Stackdriver Automation series. In part 1, I covered automating the management of Stackdriver Groups. In this post, I will walk through the steps that you can use to automate the management of Alerting Policies. Head over to part 1 for the background and prerequisites.

Alerting Policies

Alerting Policies lets you define a set of conditions that will generate a notification whenever that event is triggered. The notification can optionally contain documentation and metadata to provide additional information to the person receiving the alert. Alerting policies can be stand alone or can be attached to Stackdriver Groups.

There are several different notification channels including email and SMS notification as well and integrations with 3rd parties such as PagerDuty, HipChat and Slack. Alternatively, you can use a webhook to call a service that you define whenever a notification is generated. The types of alerting conditions together with the monitoring filters provide a rich set of options to tailor you notifications.

As an example, I created alerting based on the apache infrastructure that I described in part 1. The application itself is a simple website being hosted by apache servers behind a load-balancer. Selecting rational alerting requires consideration of the application’s users and the application architecture. I turned to the Site Reliability Engineering (SRE) methods to select the metrics used for alerting.

SRE has a concept of service level indicators (SLIs) as a way to track your application’s performance metrics as measured by user impact. Good SLIs should make it easy to know when the performance of your application is making your users frustrated and when your users are having a good experience using the app. The main SLI categories are latency, availability, volume and quality. This article uses these SLIs to drive the alerting. For a production application, the selection of SLIs should be done with more rigor. This was a demo app after all!

SLI Alerting Metrics

The main Stackdriver Alerting conditions, notifications and documentation that I selected were the following:

Conditions

For Availability & Quality:

  • L7 load balancer error request count = 0/sec

For Latency:

  • L7 Load balancer total latency > 100ms
  • L7 load balancer Frontend RTT latency > 50ms
  • L7 load balancer Backend RTT latency > 50ms

For Volume:

  • L7 load balancer request count > 100/sec

Notification channels

Documentation

  • Specific text describing each alerting policy

Now that I determined the specific metrics to monitor and the notification method, I needed to translate those rules into the format accepted by the Stackdriver AlertingPolicies API.

The projects.alertPolicies.create API lists the following values that are required to create the Alerting Policy.

{
"name": string,
"displayName": string,
"documentation": {
object(Documentation)
},
"userLabels": {
string: string,
...
},
"conditions": [
{
object(Condition)
}
],
"combiner": enum(ConditionCombinerType),
"enabled": boolean,
"notificationChannels": [
string
],
"creationRecord": {
object(MutationRecord)
},
"mutationRecord": {
object(MutationRecord)
}
}

The alerting policies are quite flexible and offer the ability to filter, group, aggregate and then measure the metrics change over time. I experimented with creating alerting policies in the UI and then used the “Try this API” sidebar in the projects.alertPolicies.create docs to view the resulting configuration. That approach along with the Stackdriver Monitoring docs helped to translate the 5 alerting conditions that I had selected into 5 alerting policies. I separated the templates into jinja templates and yaml files so that I could reuse the jinja templates for any other Alerting Policies.

stackdriver_alertingpolicies.jinja

{% set PREFIX = env["deployment"] %}
{% set NOTIFICATION_EMAILS = properties["notificationEmails"] %}
{% set POLICIES = properties["policies"] %}
{% set PROJECT = env["project"] %}
{% set DEFAULT_MIME_TYPE = "text/markdown" %}
resources:
{% for email in NOTIFICATION_EMAILS %}
- name: {{ PREFIX }}-email-{{ loop.index }}
type: gcp-types/monitoring-v3:projects.notificationChannels
properties:
name: projects/{{ PROJECT }}
type: email
displayName: {{ email.displayName }}
labels:
email_address: {{ email.emailAddress }}
enabled: true
{% endfor %}
{% for policy in POLICIES %}
- name: {{ PREFIX }}-alertingpolicy-{{ loop.index }}
type: gcp-types/monitoring-v3:projects.alertPolicies
properties:
displayName: {{ PREFIX }}-{{ policy.name }}
documentation:
content: {{ policy.documentation.content }}
mimeType: {{ DEFAULT_MIME_TYPE }}
combiner: OR
conditions:
{% for condition in policy.conditions %}
- displayName: {{ condition.displayName }}
conditionThreshold:
filter: {{ condition.filter }}
comparison: {{condition.comparison }}
duration: {{ condition.duration }}
thresholdValue: {{ condition.thresholdValue }}
trigger: {{ condition.trigger }}
aggregations: {{ condition.aggregations }}
{% endfor %}
notificationChannels:
{% for notification in NOTIFICATION_EMAILS %}
- $(ref.{{ PREFIX }}-email-{{ loop.index }}.name)
{% endfor %}
enabled: true
{% endfor %}

Notice that I created the yaml with multiple notification channels and policies. I separated the email notification creation from the policies and added the same email notifications to each policy in the jinja template.

Stackdriver_alertingpolicies.yaml

Please note that I’ve only included 2 alerting policies here for the sake of readability. See the github repo for the full yaml file.

imports:
- path: stackdriver_alertingpolicies.jinja
resources:
- name: create_alertingpolicies
type: stackdriver_alertingpolicies.jinja
properties:
notificationEmails:
- emailAddress: website-oncall@example.com
displayName: "Website Oncall"
- emailAddress: support-website@example.com
displayName: "Website Support"
policies:
- name: "1 - Availability - Google Cloud HTTP/S Load Balancing Rule - Request count (filtered) [COUNT]"
conditions:
- filter: "metric.type=\"loadbalancing.googleapis.com/https/request_count\" resource.type=\"https_lb_rule\" metric.label.response_code!=\"200\""
comparison: "COMPARISON_GT"
duration: "60s"
thresholdValue: 1
trigger:
count: 1
aggregations:
- alignmentPeriod: "60s"
perSeriesAligner: "ALIGN_RATE"
crossSeriesReducer: "REDUCE_COUNT"
displayName: "Google Cloud HTTP/S Load Balancing Rule - Request count (filtered) [COUNT]"
documentation:
content: "The load balancer rule ${condition.display_name} has generated this alert for the ${metric.display_name}."
- name: "2 - Latency - Google Cloud HTTP/S Load Balancing Rule - Total Latency (filtered) [99 percentile]"
conditions:
- filter: "metric.type=\"loadbalancing.googleapis.com/https/total_latencies\" resource.type=\"https_lb_rule\" resource.label.url_map_name=\"web-map\""
comparison: "COMPARISON_GT"
duration: "60s"
thresholdValue: 100
trigger:
count: 1
aggregations:
- alignmentPeriod: "60s"
perSeriesAligner: "ALIGN_PERCENTILE_99"
displayName: "Google Cloud HTTP/S Load Balancing Rule - Total Latency (filtered) [99 percentile]"
documentation:
content: "The load balancer rule ${condition.display_name} has generated this alert for the ${metric.display_name}."

You can find the jinja and yaml files on the github repo.

The last step was to use the gcloud command line below to actually create the Stackdriver Alerting Policies.

$ gcloud deployment-manager deployments create stackdriver-alertingpolicies-apache --config stackdriver_alertingpolicies.yamlCreate operation operation-1537809007185-576a10f9b0c68-751256f4-45889de0 completed successfully.
NAME TYPE STATE ERRORS INTENT
stackdriver-alertingpolicies-apache-alertingpolicy-1 gcp-types/monitoring-v3:projects.alertPolicies COMPLETED []
stackdriver-alertingpolicies-apache-alertingpolicy-2 gcp-types/monitoring-v3:projects.alertPolicies COMPLETED []
stackdriver-alertingpolicies-apache-alertingpolicy-3 gcp-types/monitoring-v3:projects.alertPolicies COMPLETED []
stackdriver-alertingpolicies-apache-alertingpolicy-4 gcp-types/monitoring-v3:projects.alertPolicies COMPLETED []
stackdriver-alertingpolicies-apache-alertingpolicy-5 gcp-types/monitoring-v3:projects.alertPolicies COMPLETED []
stackdriver-alertingpolicies-apache-email-1 gcp-types/monitoring-v3:projects.notificationChannels COMPLETED []
stackdriver-alertingpolicies-apache-email-2 gcp-types/monitoring-v3:projects.notificationChannels COMPLETED []

Once the alerting policies were created, I used the Stackdriver Monitoring console to verify that the alerting policies had been successfully created.

Testing the alerting

I intentionally set the metric values low to make it easy to test the alerting. In this example, picking a low latency value for alerting generates alerts that you can use to verify that the alerts arrive successfully and include the details that you intended. Here’s a sample of an alert that I received.

Whenever the condition that caused the alert was resolved, the alert was automatically resolved. Here’s the sample of the email alert resolution below.

This concludes part 2 of the series. Read more about Stackdriver Monitoring Automation in the other posts in the series and references below.

References:

--

--