Grafana 101: Getting Started with Alerting [Recap]

We finished part two of our Grafana 101 webinar series, a set of technical sessions designed to get you up and running with Grafana’s various capabilities—complete with step-by-step demos, tips, and resources to recreate what we cover.

In this one, we focus on database monitoring with Getting Started With Alerts, where I go through what alerting in Grafana entails, show you how to select and set up three common alerts for key metrics, and create triggers to send notifications through popular channels. And, to make sure you leave ready to set up your own alerting and monitoring systems, I share various best practices and things I’ve learned along the way.

We know not everyone could make it live, so we’ve published the recording and slides for anyone and everyone to access at any time.

Missed the session? Don't worry, here's the recording!

What you’ll learn about alerting

Alerting is a crucial part of any monitoring setup. But, getting them set up is often tricky and time-consuming, especially if you’re dealing with multiple data sources.

Thankfully, you can configure your visualizations and alerts for the metrics you care about in the same place, thanks to Grafana’s alerting functionality!

While Grafana may be best known for its visualization capabilities, it’s also a powerful alerting tool. Personally, I like using it to notify me about anomalies, because it saves me the overhead of adding another piece of software to my stack—and I know many community members feel the same.

I break the session into four parts:

Alerting Principles

Alerts tell us when things go wrong and get humans to take action.

When you implement alerts in any scenario, there are two important universal best practices:

  • Avoid over-alerting: If an engineer gets an alert too frequently, it ceases to be useful or serve its purpose (i.e., instead of responding quickly, people will quickly tune them out as noise).
  • Select use-case-specific, alerts: Different scenarios require monitoring different metrics, so alerts for monitoring a SaaS platform (site uptime, latency, etc.) are different than an alerts setup for monitoring infrastructure (cluster health, disk usage, CPU/memory use).

Alerts in Grafana

In this section, I cover how alerts work in Grafana and their two constituent parts: alert rules and notification channels. Note: Grafana’s alerting functionality only works for graph panels with time-series output. At first, this may seem limiting, but thankfully, it isn’t for two reasons:

  • You can convert any panel type (e.g., gauge, single stat, etc.) into a graph panel and then set up your alerts accordingly.
  • Your graphs must have the output format of ‘time-series’, which is a reasonable constraint: you want to monitor how certain metrics change over time, so your data is inherently time-series data.

You’ll also learn about the anatomy of alert rules and conditions, see how the FOR parameter works, and understand the various states alerts can take, depending on whether their associated alert rules are TRUE or FALSE.

Alerts only work on graph panels with time-series output in Grafana

Let’s Code: 3 Alerts and 3 Notification Channels

After seeing the basics, we jump into the fun part: creating and testing alerts!

Using the scenario of monitoring a production TimescaleDB database, we set up different types of alerts for common monitoring metrics and connect our alert monitoring to popular notification channels:

Alert Type: Alerts using FOR

  • Metric: Sustained high memory usage
  • Notification channel: sent via Slack, where we have a channel to notify our DevOps team about new alerts from our Grafana setup.

Alert Type: Alerts without FOR

  • Metric: Disk usage
  • Notification channel: sent via PagerDuty, where an incident is automatically created and relevant DevOps teams and support personnel are notified (according to our pre-configured PagerDuty escalation policies).

Alert Type: Alerts with NO DATA

  • Metric: Database aliveness
  • Notification channel: sent via OpsGenie, where an alert is created and sent to the DevOps team and other support personnel (according to the notification policies we’ve configured for our team in OpsGenie). We'll use Slack as an additional notification channel as well.

Alerts using FOR

This part of the demo shows how to define an alert for sustained high memory usage on the database using the Grafana alerting parameter FOR. The parameter FOR specifies the amount of time for which an alert rule must be true before the ALERTING state is triggered and an alert is sent via a notification channel.

Using FOR is common for many alerting scenarios, as you often want to wait for your alert rule to be true for a period of time in order to avoid false positives (and waking people up in the middle of the night without cause).

Once we’ve defined our alert rule in Grafana, I show you how to set up Slack as a notification channel so alert messages reach you (or the right team members) in a timely manner. You’ll see how to customize the message body with pertinent info, send notifications to specific channels, and mention team members.

We set up Slack as a notification channel to show alerts from our Grafana dashboard

I also take you through how FOR alerts work in Grafana, using a state transition diagram to give you a mental model for when, how, and why to use them.

See step-by-step lifecycles of alert states that an alert using FOR could take

Alerts without FOR

This second part of the demo shows how to use a simple threshold condition to define an alert for high disk volume; this alert rule doesn’t use the FOR parameter, so alerts are sent as soon as the rule is triggered. In this example, there’s no need to use the FOR parameter, since disk usage (the metric we’re alerting on) doesn’t fluctuate up and down with time and usually only increases. Therefore, we can send out an alert as soon as our alert condition is TRUE.

Once we’ve defined our alert rule, I show you how to connect Grafana to PagerDuty, so that alerts in Grafana create cases in PagerDuty and notify teams via phone, email, or text, based on your PagerDuty configurations (e.g., whatever rules and notification methods you’ve set up in PagerDuty).

We set up PagerDuty as a notification channel to show alerts from our Grafana dashboard

As in the first example, I use a state transition diagram to help you visualize how this works and when you’d use it versus other types.

See step-by-step lifecycles of alert states that an alert withoutFOR could take

Alerts with NO DATA

The final part of the demo shows how to define an alert for “aliveness” on our database so we know if our database is up or down. Like our high disk volume alert, we use a threshold condition and to show you how to trigger NO DATA alerts, I turn my demo database off to simulate an outage/downtime.

This immediately triggers NO DATA alerts for other alert rules on metrics from the database, namely sustained high memory and disk usage, the two alert rules we set up earlier in the demo.

NO DATA alerts rules are useful to distinguish between an alert rule being true and there being no data with which to evaluate the alert rule. You would use ALERTING as the state for the former condition and NO DATA as the state for the latter. Usually NO DATA alert rules indicate that there is a problem with the data source (e.g., the data source is down or has lost connection to Grafana).

We setup OpsGenie and Slack as a notification channels to show alerts from our Grafana dashboard

From there, I show you how to connect Grafana to OpsGenie, so that alerts in Grafana create cases and alerts in OpsGenie, which can then notify various teams, using methods like email, text, and phone call. I also show how to send notifications via our existing Slack notification channel.

And, of course, I share a diagram for this one too :).

See step-by-step lifecycles of alert states that an alert with NODATA could take

Resources and Q+A

Want to recreate the sample monitoring setup shown in the demo? Or perhaps you want to modify it to use your data sources and notification channels to set up your own monitoring system? No worries, we have you covered!

We link to several resources and tutorials to get you on your way to monitoring mastery:

  • Reason about Grafana Alert States with this handy reference chart
Summary of all Grafana alert state transitions

Community Questions
,

Here’s a selection of questions we received (and answered) during the session:

Q: Does alerting work with templating or parameterized queries?

A: Template variables in Grafana do not work when used in alerts. As a workaround, you could set up a custom graph that only displays the output of a specific value that you’re templating against (e.g., a specific server name) and then define alerts for that graph.

If you use Prometheus, Prometheus Alertmanager might be worth exploring as well.

Alternatively, you can explore using wildcards in your queries to mimic the effect of templating. See the following Grafana issues for more information: Issue 1, Issue 2

Q: If we set FOR to 5min, we wait for 5mins to go from PENDING to ALERTING. Do we wait for the same period of time to transition back from ALERTING to OK?

A: No, the FOR parameter defines the amount of time the alert rule must be in the TRUE state before an alert is sent.

The FOR parameter applies to transitions from PENDING to ALERTING only. You don’t need to wait for the FOR period of time to go from ALERTING to OK.

As soon as the alert rule evaluates FALSE (i.e., the issue is resolved), the alert state will change from ALERTING to OK and be re-evaluated the next time your alert job runs.

Q: When specifying the condition to use in your alert rule, what does “now” mean? Why do you select to alert on Query A from 5 minutes ago till now? Isn't it always now?

A: Grafana requires you to have a period of time to evaluate if alerts need to be sent. This time period is the period over which you want to do calculations on, which is why many conditions have aggregate functions like AVG, MAX, DIFF, LAST, etc.

So, when we say Query(A, 5m, now), it means that we evaluate the alert rule on Query A from 5m ago to now. You can also set alert rules that say Query(A, 6m, now-1min) to account for a lag in data being inserted in your data source or network latency.

This is important, because if an alert condition uses Query(A, 5m, now) and there is no data available for now, NO DATA alerts will fire (i.e., there isn’t any data with which to evaluate the rule).

You’ll want to select a time interval to evaluate your aggregate function that’s greater (longer) than the time period in which you add new data to the graph that’s triggering your alerts.

For example, in my demo, we scrape our database every 10 seconds. If we set an alert rule like Query(A, 5s, now), we’d end up getting NO DATA alert states constantly, because our data is not that fine grained.

Ready for More Grafana Goodness?

Thanks to those who joined me live! To those who couldn’t make it, I and the rest of Team Timescale are here to help at any time. Reach out via our Slack and we’ll happily assist!

For easy reference, here are the recordingI and slides for you to check out, re-watch, and share with teammates.

Want to level up your Grafana skills even more? Check out Guide to Grafana 101: Getting Started with Interactivity, Templating and Sharing.

To learn about future sessions and get updates about new content, releases, and other technical content, subscribe to our Biweekly Newsletter.

Excited to see you at the next session!