Dec 13, 2024
If you are working with a database, especially with time-series data, then you have likely faced the challenge of handling high-cardinality data.
In particular, time-series high cardinality is a common problem in industrial IoT (e.g., manufacturing, oil & gas, utilities), as well as some monitoring and event data workloads.
High cardinality is also a topic that we (as developers of a time-series database) end up discussing a lot, and it continues to surprise us how much confusion there is around it. Given that, we wanted to share our experience with this problem.
Broadly defined, cardinality refers to the number of values in a set. Sometimes, the cardinality of your set is small (low cardinality), and other times, it can be large (high cardinality). For example, there are quite a few (delicious) M&Ms in the following image, but the cardinality of that dataset is quite small (six).
In the world of databases, cardinality refers to the number of unique values contained in a particular column or field.
However, with time-series data, things get a bit more complex.
Time-series data (or any other data that has a time element associated with it) tends to be paired with metadata (sometimes called “tags”) that describes that data. Often, that primary time-series data or the metadata is indexed for faster query performance so that you can quickly find the values that match all of the specified tags.
The cardinality of a time-series dataset is typically defined by the cross-product of the cardinality of each individual indexed column. So if there are six colors of M&Ms, but also five types of M&Ms (plain, peanut, almond, pretzel, and crispy), then our cardinality is now 6x5 = 30 total options for M&Ms. Having the right indexes would then allow us to efficiently find all blue, crispy M&Ms (which are objectively the best).
If you have multiple indexed columns, each with a large number of unique values, then the cardinality of that cross-product can get really large. That’s what software developers typically mean when they talk about a time-series dataset with “high cardinality.”
Let’s look at an example.
Imagine an IoT scenario in which large, heavy pieces of equipment mine, crush, and sort rock in a quarry.
Say there are 10,000 pieces of equipment, each with 100 sensors, running 10 different firmware versions, spread across 100 sites:
timestamp | temper | mem_f | equipm | senso | firmwar | sit | (lat,long)
| ature | ree | ent_id | r_id | e_version | e_id |
--------------------+-------+--------+-------------------+------+-----------
2019-04-04 | 85.2 | 10.2 | 1 | 98 | 1.0 | 4 | (x,y)
09:00:00 | | | | | | |
2019-04-04 | 68.8 | 16.0 | 72 | 12 | 1.1 | 20 | (x1,y1)
09:00:00 | | | | | | |
2019-04-04 | 100.0 | 0.0 | 34 | 58 | 2.1 | 55 | (x2,y2)
09:00:00 | | | | | | |
2019-04-04 | 84.8 | 9.8 | 12 | 75 | 1.4 | 81 | (x3,y3)
09:00:00 | | | | | | |
2019-04-04 | 68.7 | 16.0 | 89 | 4 | 2.1 | 13 | (x4,y4)
09:00:00 | | | | | | |
... | | | | | | |
The maximum cardinality of this dataset then becomes one billion [10,000 x 100 x 10 x 100].
Now, imagine that the equipment can move as well, and we’d like to store the precise GPS location (lat, long) and use that as indexed metadata to query by. Because (lat, long) is a continuous field (as opposed to a discrete field like equipment_id), by indexing on location, the max cardinality of this dataset is now infinitely large (unbounded).
High cardinality in data sets is more than just a technical challenge—it's often a sign of rich, detailed information that can drive deeper insights and more precise analysis. As organizations collect more granular data, cardinality naturally increases because each unique data point helps paint a more complete picture of the system being monitored.
Think of our previous industrial IoT example: at its simplest level, we might only track which site a piece of equipment is operating in, resulting in relatively low cardinality with just 100 unique locations across all sites.
However, as we increase the precision and scope of our monitoring, the cardinality grows dramatically. Each of the 10,000 pieces of equipment has 100 distinct sensors, monitoring everything from vibration patterns to temperature readings while running one of 10 different firmware versions.
This means that at any given moment, we're tracking millions of unique combinations of equipment IDs, sensor readings, and firmware configurations across all sites. This higher cardinality translates directly into more precise equipment performance analysis, better maintenance scheduling, and more accurate prediction of potential failures before they occur.
This relationship between detail and cardinality extends across virtually every domain. As organizations invest in more sophisticated sensor networks, deploy more detailed tracking systems, or capture more granular user behavior data, they inevitably increase the cardinality of their data sets. While this presents certain technical challenges, it's a natural consequence of building more sophisticated and capable systems.
The difficulties of managing high-cardinality data become apparent when we look at common database operations. Full table scans, which require examining every row in a table, become particularly resource-intensive with high-cardinality data. Since each value is unique, the database can't take shortcuts–it must individually process each row, consuming significant memory and computational resources.
Join operations present an even more daunting challenge. When joining tables with high cardinality, the number of potential combinations multiplies dramatically.
Let's return to our quarry IoT example: imagine joining a table of equipment sensor data (millions of unique sensor readings across one million possible combinations of equipment and sensor types) with a table of equipment maintenance records (hundreds of thousands of unique maintenance events across 10,000 pieces of equipment, each running different firmware versions).
A simple query to correlate sensor readings with maintenance events would need to evaluate hundreds of billions of potential combinations—analyzing every sensor reading against every maintenance event for each piece of equipment across different firmware versions. This scale of computation could overwhelm typical database hardware, potentially taking hours or even days to complete, making real-time analysis virtually impossible without proper optimization.
For instance, if you're trying to correlate vehicle locations with maintenance events over a year, you might be attempting to join a table with 10 million unique location readings against a table with 100,000 unique sensor readings. The resulting operation would need to evaluate a trillion potential combinations—a task that could take hours or even days on standard hardware.
Fortunately, modern database systems offer solutions to these challenges through sophisticated indexing strategies. Indexing helps manage high-cardinality data by organizing it into more manageable subsets, allowing queries to target specific portions of the data rather than scanning entire tables.
This approach is particularly effective when you need to query specific time ranges or value ranges, as it enables the database to quickly locate and process only the relevant data.
Check out our detailed comparison of database indexing strategies to learn more about how different database systems handle these challenges.
High-cardinality data, characterized by large numbers of unique values, represents both an opportunity and a challenge for modern data systems. While it enables more precise analysis and deeper insights, it also requires careful consideration of how to manage and query the data efficiently. The key lies in understanding that high cardinality isn't inherently problematic—it's a natural result of collecting more detailed, valuable data.
The challenges of high-cardinality data can be effectively managed with the right tools and approaches. Timescale's state-of-the-art indexing solutions are specifically designed to handle high-cardinality time-series data and real-time analytics workloads, enabling you to maintain both performance and precision in your data operations.
Ready to see how Timescale can help you manage your high-cardinality data? Start your free trial today.