You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
52 lines
3.1 KiB
Markdown
52 lines
3.1 KiB
Markdown
5 months ago
|
---
|
||
|
draft: false
|
||
|
title: "Never forget is_alert_recovery"
|
||
|
aliases: ["Never forget is_alert_recovery"]
|
||
|
series: []
|
||
|
date: "2023-03-05"
|
||
|
author: "Nick Dumas"
|
||
|
cover: ""
|
||
|
keywords: ["", ""]
|
||
|
description: "Making sure PagerDuty leaves you alone"
|
||
|
showFullContent: false
|
||
|
tags:
|
||
|
- pagerduty
|
||
|
- datadog
|
||
|
- devops
|
||
|
---
|
||
|
|
||
|
## Synthetics and You
|
||
|
In the context of monitoring, a synthetic test is one that simulates an actual user. This is a useful and important part of the visibility triad: logs, **metrics**, and traces. Synthetics let you take (approximate) measurements of what a real user might experience which can help maintain SLAs or act as health checks for your connection between origins an CDNs.
|
||
|
## Hands Off Metrics
|
||
|
The system we have is working great. The synthetics are provisioned into Datadog with a very clever system that pulls from a YAML file, sparing us from having to hard code every single monitor.
|
||
|
|
||
|
Alerts are handled via PagerDuty which is a pretty good enterprise paging solution.
|
||
|
|
||
|
Together, these monitor internal (micro)services as well as performing synthetic testing by loading a headless browser instance to navigate the site. This gives us great visibility into what's healthy and what's not after deployments or maintenance.
|
||
|
## This alert will retrigger every 10 minutes
|
||
|
Around 0230 Sunday morning, I got an alert. A synthetic targeting one of our key user-facing pages had triggered. First step is to open the incident in PagerDuty.
|
||
|
|
||
|
Nothing looked out of the ordinary, and when I followed the links the monitor showed the page returned a 300 error for about 3 minutes and then resumed with the 200s. I thought nothing of it, and went to sleep after a long and productive weekend.
|
||
|
|
||
|
I woke up to find out the incident had been paging all night. What happened?
|
||
|
|
||
|
I loaded the monitor and it had been green for hours. Not a great sign.
|
||
|
## is_alert_recovery
|
||
|
After a bit of investigation and an assist from a good friend, we traced down the root cause.
|
||
|
|
||
|
Your Datadog monitors have a field you can use to define a string that will be used as the message for pages. Confusingly, this string is the same place where you configure where the message is sent.
|
||
|
|
||
|
You'd think an enterprise application would let you send different messages to different destinations. Oh well.
|
||
|
|
||
|
The monitor message was the culprit here. It turns out that there's a very important variable: `is_alert_recovery`. If you don't use this, Datadog will not send PagerDuty the "stop triggering this incident" signal, even when the monitor resolves.
|
||
|
|
||
|
```
|
||
|
{{#is_alert_recovery}} Customer facing page failed to return an HTTP 200 response within 5 seconds. @pagerduty-Orgname-teamname @teams-Orgname-teamname
|
||
|
{{/is_alert_recovery}}
|
||
|
```
|
||
|
|
||
|
This was a real pain in the ass. The monitor was re-triggering **every ten minutes**. Luckily I have a good team to work with, and I was familiar with the monitors since I created them. The solution? Manually resolve the incident. Fixed. It didn't retrigger.
|
||
|
## A good night's sleep
|
||
|
I didn't read the documentation when creating my monitor, or check for best-practices. This one's fully on me. Hopefully I'll remember next time.
|
||
|
|