3.1 KiB
draft | title | aliases | series | date | author | cover | keywords | summary | showFullContent | tags | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
false | Never forget is_alert_recovery |
|
2023-03-05 | Nick Dumas | Making sure PagerDuty leaves you alone | false |
|
Synthetics and You
In the context of monitoring, a synthetic test is one that simulates an actual user. This is a useful and important part of the visibility triad: logs, metrics, and traces. Synthetics let you take (approximate) measurements of what a real user might experience which can help maintain SLAs or act as health checks for your connection between origins an CDNs.
Hands Off Metrics
The system we have is working great. The synthetics are provisioned into Datadog with a very clever system that pulls from a YAML file, sparing us from having to hard code every single monitor.
Alerts are handled via PagerDuty which is a pretty good enterprise paging solution.
Together, these monitor internal (micro)services as well as performing synthetic testing by loading a headless browser instance to navigate the site. This gives us great visibility into what's healthy and what's not after deployments or maintenance.
This alert will retrigger every 10 minutes
Around 0230 Sunday morning, I got an alert. A synthetic targeting one of our key user-facing pages had triggered. First step is to open the incident in PagerDuty.
Nothing looked out of the ordinary, and when I followed the links the monitor showed the page returned a 300 error for about 3 minutes and then resumed with the 200s. I thought nothing of it, and went to sleep after a long and productive weekend.
I woke up to find out the incident had been paging all night. What happened?
I loaded the monitor and it had been green for hours. Not a great sign.
is_alert_recovery
After a bit of investigation and an assist from a good friend, we traced down the root cause.
Your Datadog monitors have a field you can use to define a string that will be used as the message for pages. Confusingly, this string is the same place where you configure where the message is sent.
You'd think an enterprise application would let you send different messages to different destinations. Oh well.
The monitor message was the culprit here. It turns out that there's a very important variable: is_alert_recovery
. If you don't use this, Datadog will not send PagerDuty the "stop triggering this incident" signal, even when the monitor resolves.
{{#is_alert_recovery}} Customer facing page failed to return an HTTP 200 response within 5 seconds. @pagerduty-Orgname-teamname @teams-Orgname-teamname
{{/is_alert_recovery}}
This was a real pain in the ass. The monitor was re-triggering every ten minutes. Luckily I have a good team to work with, and I was familiar with the monitors since I created them. The solution? Manually resolve the incident. Fixed. It didn't retrigger.
A good night's sleep
I didn't read the documentation when creating my monitor, or check for best-practices. This one's fully on me. Hopefully I'll remember next time.