Mar 08, 2024
📚 Welcome to our series about metrics! In this first post, we deep-dived into the four types of Prometheus metrics; then, we examined how metrics work in OpenTelemetry; and finally, we put the two together—explaining the differences, similarities, and integration between the metrics in both systems.
Metrics measure performance, consumption, productivity, and many other software properties over time. They allow engineers to monitor the evolution of a series of measurements (like CPU or memory usage, request duration, latencies, and so on) via alerts and dashboards. Metrics have a long history in the world of IT monitoring and are widely used by engineers together with logs and traces to detect when systems don’t perform as expected.
In its most basic form, a metric data point is made of:
In the last ten years, as systems have become increasingly complex, the concept of dimensional metrics, that is, metrics that also include a set of tags or labels (i.e., the dimensions) to provide additional context, emerged. Monitoring systems that support dimensional metrics allow engineers to easily aggregate and analyze a metric across multiple components and dimensions by querying for a specific metric name and filtering and grouping by label.
For modern dynamic systems comprising many components, Prometheus, a Cloud Native Computing Foundation (CNCF) project, has become the most popular open-source monitoring software and is effectively the industry standard for metrics monitoring.
Prometheus defines a metric exposition format and a remote write protocol that the community and many vendors have adopted to expose and collect metrics, becoming a de facto standard. OpenMetrics is another CNCF project that builds upon the Prometheus exposition format to offer a vendor-agnostic, standardized model for the collection of metrics that aims to be part of the Internet Engineering Task Force (IEFT).
More recently, another CNCF project, OpenTelemetry, has emerged with the goal of providing a new standard that unifies the collection of metrics, traces, and logs, enabling easier instrumentation and correlation across telemetry signals.
With a few different options to pick from, you may be wondering which standard is best for you. To help you answer this question, we have prepared a three-part blog post series in which we will be diving deep into the metric standards hosted by the CNCF.
In this first post, we will cover Prometheus metrics; in the next one, we will review OpenTelemetry metrics; and in the final blog post, we will directly compare both formats—providing some recommendations for better interoperability.
Our hope is that after reading these blog posts, you will understand the differences between each standard, so you can decide which one would best address your current (and future) needs.
First things first. The four types of metrics collected by Prometheus as part of its exposition format include Counters, Gauges, Histograms, and Summaries. These metrics are collected using a pull model where Prometheus scrapes HTTP endpoints that expose these metrics.
Those endpoints can be natively exposed by the component being monitored or exposed via one of the hundreds of Prometheus exporters built by the community. Prometheus provides client libraries in different programming languages that you can use to instrument your code.
The pull model works great when monitoring a Kubernetes cluster, thanks to service discovery and shared network access within the cluster, but it’s harder to use to monitor a dynamic fleet of virtual machines, AWS Fargate containers, or Lambda functions with Prometheus. Why?
It’s difficult to identify the metrics endpoints to be scraped, and access to those endpoints may be limited by network security policies. To solve some of those problems, the community released the Prometheus Agent Mode at the end of 2021, which only collects metrics and sends them to a monitoring backend using the remote write protocol.
Prometheus can scrape metrics in both the Prometheus exposition and the OpenMetrics formats. In both cases, metrics are exposed via HTTP using a simple text-based format (more commonly used and widely supported) or a more efficient and robust protocol buffer format. One big advantage of the text format is that it is human-readable, which means you can open it in your browser or use a tool like curl to retrieve the current set of exposed metrics.
Prometheus uses a very simple metric model with four metric types that are only supported in the client libraries. All the metric types are represented in the exposition format using one or a combination of a single underlying data type. This data type includes a metric name, a set of labels, and a float value. The timestamp is added by the monitoring backend (Prometheus, for example) or an agent when they scrape the metrics.
Each unique combination of a metric name and set of labels defines a series, while each timestamp and float value defines a sample (i.e., a data point) within a series.
Some conventions are used to represent the different metric types.
A very useful feature of the Prometheus exposition format is the ability to associate metadata to metrics to define their type and provide a description. For example, Prometheus makes that information available, and Grafana uses it to display additional context to the user that helps them select the right metric and apply the right PromQL functions:
Example of a metric exposed using the Prometheus exposition format:
# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total counter
http_requests_total{api="add_product"} 4633433
# HELP
is used to provide a description for the metric and # TYPE
a type for the metric.
Now, let's get into more detail about each of the Prometheus metrics in the exposition format.
Counter metrics are used for measurements that only increase. Therefore they are always cumulative—their value can only go up. The only exception is when the counter is restarted, in which case its value is reset to zero.
The actual value of a counter is not typically very useful on its own. A counter value is often used to compute the delta between two timestamps or the rate of change over time.
For example, a typical use case for counters is measuring API calls, which is a measurement that will always increase:
# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total counter
http_requests_total{api="add_product"} 4633433
The metric name is http_requests_total
, it has one label named api
with a value of add_product
and the counter’s value is 4633433
. This means that the add_product
API has been called 4,633,433 times since the last service start or counter reset. By convention, counter metrics are usually suffixed with _total
.
The absolute number does not give us much information, but when used with PromQL’s rate
function (or a similar function in another monitoring backend), it helps us understand the requests per second that API is receiving. The PromQL query below calculates the average requests per second over the last five minutes:
rate(http_requests_total{api="add_product"}[5m])
To calculate the absolute change over a time period, we would use a delta function which in PromQL is called increase():
increase(http_requests_total{api="add_product"}[5m])
This would return the total number of requests made in the last five minutes, and it would be the same as multiplying the per-second rate by the number of seconds in the interval (five minutes in our case):
rate(http_requests_total{api="add_product"}[5m]) * 5 * 60
Other examples where you would want to use a counter metric would be to measure the number of orders in an e-commerce site, the number of bytes sent and received over a network interface, or the number of errors in an application. If it is a metric that will always go up, use a counter.
Below is an example of how to create and increase a counter metric using the Prometheus client library for Python:
from prometheus_client import Counter
api_requests_counter = Counter(
'http_requests_total',
'Total number of http api requests',
['api']
)
api_requests_counter.labels(api='add_product').inc()
Note that since counters can be reset to zero, you want to make sure that the backend you use to store and query your metrics will support that scenario and still provide accurate results in case of a counter restart.
Gauge metrics are used for measurements that can arbitrarily increase or decrease. This is the metric type you are likely more familiar with since the actual value with no additional processing is meaningful, and they are often used. For example, metrics to measure temperature, CPU, and memory usage or the size of a queue are gauges.
For example, to measure the memory usage in a host, we could use a gauge metric like:
# HELP node_memory_used_bytes Total memory used in the node in bytes
# TYPE node_memory_used_bytes gauge
node_memory_used_bytes{hostname="host1.domain.com"} 943348382
The metric above indicates that the memory used in node host1.domain.com
at the time of the measurement is around 900 megabytes. The value of the metric is meaningful without any additional calculation because it tells us how much memory is being consumed on that node.
Unlike when using counters, rate
and delta
functions don’t make sense with gauges. However, functions that compute the average, maximum, minimum, or percentiles for a specific series are often used with gauges. In Prometheus, the names of those functions are avg_over_time
, max_over_time
, min_over_time
, and quantile_over_time
. To compute the average of memory used on host1.domain.com
in the last ten minutes, you could do this:
avg_over_time(node_memory_used_bytes{hostname="host1.domain.com"}[10m])
To create a gauge metric using the Prometheus client library for Python, you would do something like this:
from prometheus_client import Gauge
memory_used = Gauge(
'node_memory_used_bytes',
'Total memory used in the node in bytes',
['hostname']
)
memory_used.labels(hostname='host1.domain.com').set(943348382)
Histogram metrics are useful to represent a distribution of measurements. They are often used to measure request duration or response size.
Histograms divide the entire range of measurements into a set of intervals—named buckets—and count how many measurements fall into each bucket.
A histogram metric includes a few items:
_count
suffix._sum
suffix. _bucket
suffix and a le label
indicating the bucket upper inclusive bound. Buckets in Prometheus are inclusive, that is, a bucket with an upper bound of N (i.e., le label
) includes all data points with a value less than or equal to N.For example, the summary metric to measure the response time of the instance of the add_product
API endpoint running on host1.domain.com
could be represented as:
# HELP http_request_duration_seconds Api requests response time in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds_bucket{api="add_product" instance="host1.domain.com" le="0"}
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.01"} 0
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.025"} 8
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.05"} 1672
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.25"} 14251
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.5"} 24101
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="1"} 26351
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="2.5"} 27534
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="5"} 27814
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="10"} 27881
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="25"} 27890
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="+Inf"} 27892
The example above includes the sum
, the count
, and 12 buckets. The sum
and count
can be used to compute the average of a measurement over time. In PromQL, the average duration for the last five minutes will be computed as follows:
rate(http_request_duration_seconds_sum{api="add_product", instance="host1.domain.com"}[5m]) / rate(http_request_duration_seconds_count{api="add_product", instance="host1.domain.com"}[5m])
It can also be used to compute averages across series. The following PromQL query would compute the average request duration in the last five minutes across all APIs and instances:
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
With histograms, you can compute percentiles at query time for individual series as well as across series. In PromQL, we would use the histogram_quantile
function. Prometheus uses quantiles instead of percentiles. They are essentially the same thing, but quantiles are represented on a scale of 0 to 1, while percentiles are represented on a scale of 0 to 100. To compute the 99th percentile (0.99 quantiles) of response time for the add_product
API running on host1.domain.com
, you would use the following query:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com"}[5m]))
One big advantage of histograms is that they can be aggregated. The following query returns the 99th percentile of response time across all APIs and instances:
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
In cloud-native environments, where there are typically many instances of the same component running, the ability to aggregate data across instances is key.
Histograms have three main drawbacks:
le label
) smaller than one second would be useless and just consume compute and storage resources on your monitoring backend. On the other hand, if 99.9 % of your API requests take less than 50 milliseconds, having an initial bucket with an upper bound of 100 milliseconds will not allow you to measure the performance of the API accurately.The following example shows how you can create a histogram metric with custom buckets using the Prometheus client library for Python:
from prometheus_client import Histogram
api_request_duration = Histogram(
name='http_request_duration_seconds',
documentation='Api requests response time in seconds',
labelnames=['api', 'instance'],
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25 )
)
api_request_duration.labels(
api='add_product',
instance='host1.domain.com'
).observe(0.3672)
Like histograms, summary metrics are useful to measure request duration and response sizes.
A summary metric includes these items:
_count
suffix._sum
suffix. Optionally, a number of quantiles of measurements are exposed as a gauge using the metric name with a quantile label. Since you don’t want those quantiles to be measured from the entire time an application has been running, Prometheus client libraries use streamed quantiles that are computed over a sliding time window (which is usually configurable).For example, the summary metric to measure the response time of the instance of the add_product
API endpoint running on host1.domain.com
could be represented as:
# HELP http_request_duration_seconds Api requests response time in seconds
# TYPE http_request_duration_seconds summary
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0"}
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.5"} 0.232227334
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.90"} 0.821139321
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.95"} 1.528948804
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.99"} 2.829188272
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="1"} 34.283829292
This example above includes the sum and count as well as five quantiles. Quantile 0 is equivalent to the minimum value, and quantile 1 is equivalent to the maximum value. Quantile 0.5 is the median, and quantiles 0.90, 0.95, and 0.99 correspond to the 90th, 95th, and 99th percentile of the response time for the add_product API
endpoint running on host1.domain.com
.
Like histograms, summaries include sum and count that can be used to compute the average of a measurement over time and across time series.
Summaries provide more accurate quantiles than histograms, but those quantiles have three main drawbacks:
add_product
API endpoint was running on ten hosts sitting behind a load balancer. There is no aggregation function that we could use to compute the 99th percentile of the response time of the add_product
API endpoint across all requests regardless of which host they hit. We could only see the 99th percentile for each individual host. Same thing if instead of the 99th percentile of the response time for the add_product
API endpoint, we wanted to get the 99th percentile of the response time across all API requests regardless of which endpoint they hit.The code below creates a summary metric using the Prometheus client library for Python:
from prometheus_client import Summary
api_request_duration = Summary(
'http_request_duration_seconds',
'Api requests response time in seconds',
['api', 'instance']
)
api_request_duration.labels(api='add_product', instance='host1.domain.com').observe(0.3672)
The code above does not define any quantile and would only produce sum and count metrics. The Prometheus client library for Python does not have support for quantiles in summary metrics.
In most cases, histograms are preferred since they are more flexible and allow for aggregated percentiles.
Summaries are useful in cases where percentiles are not needed and averages are enough or when very accurate percentiles are required. For example, in the case of contractual obligations for the performance of a critical system.
The table below summarizes the pros and cons of histograms and summaries.
In the first part of this blog post series on metrics, we’ve reviewed the four types of Prometheus metrics: counters, gauges, histograms, and summaries. In the next part of the series, we will dissect OpenTelemetry metrics.
If you're looking for a time-series database to store your metrics, check out Timescale. You will especially love it if you're using PostgreSQL. Timescale is a PostgreSQL extension that will give PostgreSQL the boost it needs to handle large volumes of metrics, keeping your writes and queries fast via automatic partitioning, query planner enhancements, improved materialized views, columnar compression, and much more.
Try Timescale today: create a free account on our platform.
Hi, we’re Timescale! We build a faster PostgreSQL for demanding workloads like time series, vector, events, and analytics data. Check us out.