Category: All posts
Feb 19, 2024
Posted by
Avthar Sewrathan
We finished part two of our Grafana 101 webinar series, a set of technical sessions designed to get you up and running with Grafana’s various capabilities—complete with step-by-step demos, tips, and resources to recreate what we cover.
In this one, we focus on database monitoring with Getting Started With Alerts, where I go through what alerting in Grafana entails, show you how to select and set up three common alerts for key metrics, and create triggers to send notifications through popular channels. And, to make sure you leave ready to set up your own alerting and monitoring systems, I share various best practices and things I’ve learned along the way.
We know not everyone could make it live, so we’ve published the recording and slides for anyone and everyone to access at any time.
Alerting is a crucial part of any monitoring setup. But, getting them set up is often tricky and time-consuming, especially if you’re dealing with multiple data sources.
Thankfully, you can configure your visualizations and alerts for the metrics you care about in the same place, thanks to Grafana’s alerting functionality!
While Grafana may be best known for its visualization capabilities, it’s also a powerful alerting tool. Personally, I like using it to notify me about anomalies, because it saves me the overhead of adding another piece of software to my stack—and I know many community members feel the same.
I break the session into four parts:
Alerts tell us when things go wrong and get humans to take action.
When you implement alerts in any scenario, there are two important universal best practices:
In this section, I cover how alerts work in Grafana and their two constituent parts: alert rules and notification channels. Note: Grafana’s alerting functionality only works for graph panels with time-series output. At first, this may seem limiting, but thankfully, it isn’t for two reasons:
You’ll also learn about the anatomy of alert rules and conditions, see how the FOR
parameter works, and understand the various states alerts can take, depending on whether their associated alert rules are TRUE
or FALSE
.
After seeing the basics, we jump into the fun part: creating and testing alerts!
Using the scenario of monitoring a production TimescaleDB database, we set up different types of alerts for common monitoring metrics and connect our alert monitoring to popular notification channels:
Alert Type: Alerts using FOR
Alert Type: Alerts without FOR
Alert Type: Alerts with NO DATA
Alerts using FOR
This part of the demo shows how to define an alert for sustained high memory usage on the database using the Grafana alerting parameter FOR
. The parameter FOR
specifies the amount of time for which an alert rule must be true before the ALERTING state is triggered and an alert is sent via a notification channel.
Using FOR
is common for many alerting scenarios, as you often want to wait for your alert rule to be true for a period of time in order to avoid false positives (and waking people up in the middle of the night without cause).
Once we’ve defined our alert rule in Grafana, I show you how to set up Slack as a notification channel so alert messages reach you (or the right team members) in a timely manner. You’ll see how to customize the message body with pertinent info, send notifications to specific channels, and mention team members.
I also take you through how FOR
alerts work in Grafana, using a state transition diagram to give you a mental model for when, how, and why to use them.
Alerts without FOR
This second part of the demo shows how to use a simple threshold condition to define an alert for high disk volume; this alert rule doesn’t use the FOR
parameter, so alerts are sent as soon as the rule is triggered. In this example, there’s no need to use the FOR
parameter, since disk usage (the metric we’re alerting on) doesn’t fluctuate up and down with time and usually only increases. Therefore, we can send out an alert as soon as our alert condition is TRUE
.
Once we’ve defined our alert rule, I show you how to connect Grafana to PagerDuty, so that alerts in Grafana create cases in PagerDuty and notify teams via phone, email, or text, based on your PagerDuty configurations (e.g., whatever rules and notification methods you’ve set up in PagerDuty).
As in the first example, I use a state transition diagram to help you visualize how this works and when you’d use it versus other types.
Alerts with NO DATA
The final part of the demo shows how to define an alert for “aliveness” on our database so we know if our database is up or down. Like our high disk volume alert, we use a threshold condition and to show you how to trigger NO DATA
alerts, I turn my demo database off to simulate an outage/downtime.
This immediately triggers NO DATA
alerts for other alert rules on metrics from the database, namely sustained high memory and disk usage, the two alert rules we set up earlier in the demo.
NO DATA
alerts rules are useful to distinguish between an alert rule being true and there being no data with which to evaluate the alert rule. You would use ALERTING
as the state for the former condition and NO DATA
as the state for the latter. Usually NO DATA
alert rules indicate that there is a problem with the data source (e.g., the data source is down or has lost connection to Grafana).
From there, I show you how to connect Grafana to OpsGenie, so that alerts in Grafana create cases and alerts in OpsGenie, which can then notify various teams, using methods like email, text, and phone call. I also show how to send notifications via our existing Slack notification channel.
And, of course, I share a diagram for this one too :).
Want to recreate the sample monitoring setup shown in the demo? Or perhaps you want to modify it to use your data sources and notification channels to set up your own monitoring system? No worries, we have you covered!
We link to several resources and tutorials to get you on your way to monitoring mastery:
Here’s a selection of questions we received (and answered) during the session:
Q: Does alerting work with templating or parameterized queries?
A: Template variables in Grafana do not work when used in alerts. As a workaround, you could set up a custom graph that only displays the output of a specific value that you’re templating against (e.g., a specific server name) and then define alerts for that graph.
If you use Prometheus, Prometheus Alertmanager might be worth exploring as well.
Alternatively, you can explore using wildcards in your queries to mimic the effect of templating. See the following Grafana issues for more information: Issue 1, Issue 2
Q: If we set FOR
to 5min, we wait for 5mins to go from PENDING
to ALERTING
. Do we wait for the same period of time to transition back from ALERTING
to OK
?
A: No, the FOR
parameter defines the amount of time the alert rule must be in the TRUE
state before an alert is sent.
The FOR
parameter applies to transitions from PENDING
to ALERTING
only. You don’t need to wait for the FOR
period of time to go from ALERTING
to OK
.
As soon as the alert rule evaluates FALSE
(i.e., the issue is resolved), the alert state will change from ALERTING
to OK
and be re-evaluated the next time your alert job runs.
Q: When specifying the condition to use in your alert rule, what does “now” mean? Why do you select to alert on Query A from 5 minutes ago till now? Isn't it always now?
A: Grafana requires you to have a period of time to evaluate if alerts need to be sent. This time period is the period over which you want to do calculations on, which is why many conditions have aggregate functions like AVG
, MAX
, DIFF
, LAST
, etc.
So, when we say Query(A, 5m, now)
, it means that we evaluate the alert rule on Query A from 5m ago to now. You can also set alert rules that say Query(A, 6m, now-1min)
to account for a lag in data being inserted in your data source or network latency.
This is important, because if an alert condition uses Query(A, 5m, now)
and there is no data available for now
, NO DATA
alerts will fire (i.e., there isn’t any data with which to evaluate the rule).
You’ll want to select a time interval to evaluate your aggregate function that’s greater (longer) than the time period in which you add new data to the graph that’s triggering your alerts.
For example, in my demo, we scrape our database every 10 seconds. If we set an alert rule like Query(A, 5s, now)
, we’d end up getting NO DATA
alert states constantly, because our data is not that fine grained.
Thanks to those who joined me live! To those who couldn’t make it, I and the rest of Team Timescale are here to help at any time. Reach out via our Slack and we’ll happily assist!
For easy reference, here are the recordingI and slides for you to check out, re-watch, and share with teammates.
Want to level up your Grafana skills even more? Check out Guide to Grafana 101: Getting Started with Interactivity, Templating and Sharing.
To learn about future sessions and get updates about new content, releases, and other technical content, subscribe to our Biweekly Newsletter.
Excited to see you at the next session!