Category: All posts
Dec 13, 2024
Time-series data, IoT sensor readings, user behavior logs—these are just a few examples of the data streams that modern systems must process. What they share in common is their tendency toward high cardinality, which presents unique challenges for data storage and analysis. As organizations increasingly rely on data-driven decisions, understanding how different databases handle high-cardinality data is vital for building efficient and scalable systems.
In this article, we'll explore the challenges presented by high-cardinality data, examine various database tools designed to handle it and compare approaches to help you make informed decisions about your data architecture.
High cardinality refers to the number of unique elements in a dataset, and it's a concept that becomes particularly tangible when we look at real-world examples. Imagine a system tracking user interactions on a popular website—each user might have a unique identifier, each session generates a unique ID, and every interaction creates a unique event ID. In large-scale applications, these unique values can quickly number in the millions or even billions.
This abundance of unique values creates significant challenges for database systems. When performing joins between tables with high-cardinality columns, the potential combinations multiply exponentially. For instance, joining user interaction data with session data might require matching millions of unique user IDs with millions of unique session IDs. The resulting operation can quickly overwhelm system resources, as the database must maintain and process these massive sets of unique combinations.
Performance degradation becomes particularly acute during operations that require full table scans. When a database needs to analyze or aggregate data across high-cardinality columns, it must maintain distinct counters or aggregates for each unique value in memory. This can rapidly exhaust available memory resources, leading to slower query execution times or, in extreme cases, system failures.
Read this article to learn more about high cardinality.
Given how common high-cardinality datasets are within time series, let’s take a look at how two time-series databases, InfluxDB and TimescaleDB, handle this issue.
InfluxDB is a NoSQL database for which its creators have chosen to rebuild everything from scratch. In contrast, TimescaleDB is a SQL database for which its creators (namely, your authors) have chosen to embrace and build on top of PostgreSQL and proven data structures and then further extend it for time series, events, and real-time analytics problems. (By the way, with the right extensions, it can also propel your AI application development.)
First, here is a comparison of how the two databases perform on inserts as the dataset's cardinality increases.
For the following comparison, we used this setup:
(1) TimescaleDB schema: Table cpu(time timestamp, tags_id integer, usage_user double, usage_system double, usage_idle double, usage_nice double, usage_iowait double, usage_irq double, usage_softirq double, usage_steal double, usage_guest double, usage_guest_nice double, additional_tags jsonb); Index on (tags_id, time) and (time, tags_id); Table tags(id integer, hostname text, region text, datacenter text, rack text, os text, arch text, team text, service text, service_version text, service_environment text); Unique index on all the columns together
(2) InfluxDB schema: Field Keys(usage_guest integer, usage_guest_nice integer, usage_idle integer, usage_iowait integer, usage_irq integer, usage_nice integer, usage_softirq integer, usage_steal integer, usage_system integer, usage_user integer), Tag Keys(arch, datacenter, hostname, os, rack, region, service, service_environment, service_version, team)
Note: a more detailed overall comparison of these two databases can be found here.
As you can see, at low cardinality, both databases are comparable (although TimescaleDB outperforms by 30 %). But as cardinality increases, the difference is quite remarkable, as the insert performance for TimescaleDB degrades far more slowly than that of InfluxDB, which falls rather precipitously. At high cardinalities, TimescaleDB outperforms InfluxDB by over 11x.
These results may not be surprising to some, as high cardinality is a well-known weakness for InfluxDB (source: GitHub, Forums).
But why does this happen? Let’s take a closer look at how these two databases are being developed.
We can trace the difference in high-cardinality performance to fundamentally different engineering decisions in InfluxDB vs. TimescaleDB.
Since high cardinality has been a well-known challenge for InfluxDB, their team has been working on something they call the “Time Series Index” (TSI) to address this problem.
Consistent with their approach in other areas, the InfluxDB TSI is a home-grown log-structured merge tree-based system comprised of various data structures, including hashmaps and bitsets. This includes an in-memory log (“LogFile”) that gets periodically flushed to disk when it exceeds a threshold (5 MB) and compacted to an on-disk memory-mapped index (“IndexFile”); a file (“SeriesFile”) that contains a set of all series keys across the entire database. (Described here in their documentation.)
The performance of the TSI relies on the interactions of all of these data structures. However, because the TSI is custom-built, understanding how it performs under various high-cardinality workloads becomes difficult to understand.
The design decisions behind the TSI also lead to a few limitations with performance implications:
SELECT * FROM sensor_data WHERE mem_free = 0
), one could not do better than a full linear scan (i.e., O(n) time
) to identify the relevant data points.SELECT * FROM sensor_data WHERE temperature > 90
), one would again have to fully scan the entire dataset.In contrast, TimescaleDB is a relational database that relies on a proven data structure for indexing data: the B-tree. This decision allows it to scale to high cardinalities.
First, TimescaleDB partitions your data by time, with one B-tree mapping time segments to the appropriate partition (“chunk”). All of this partitioning happens behind the scenes and is hidden from the user, who is able to access a virtual table (“hypertable”) that spans all of their data across all partitions.
Next, TimescaleDB allows for the creation of multiple indexes across your dataset (e.g., for equipment_id
, sensor_id
, firmware_version
, site_id
). These indexes are then created on every chunk, by default, in the form of a B-tree.
One can also create indexes using any of the built-in PostgreSQL index types: Hash, GiST, SP-GiST, GIN, and BRIN. You can read this article to learn more about indexes and how to use them to optimize your PostgreSQL database performance.
This approach has a few benefits for high-cardinality datasets:
<
, <=
, =
, >=
, >
,BETWEEN
, IN
, IS NULL
, IS NOT NULL
. Our example queries from above ( SELECT * FROM sensor_data WHERE mem_free = 0
and SELECT * FROM sensor_data WHERE temperature > 90
) will run in logarithmic, or O(log n)
, time.The challenges posed by high-cardinality data in modern database systems demand sophisticated indexing solutions to overcome the inherent obstacles of join operations and full table scans. Both InfluxDB and Timescale have unique strategies for managing high-cardinality data effectively.
Timescale's approach leverages the power of B-tree data structures, providing a robust foundation for handling high-cardinality data sets. This structure enables not only superior query performance but also offers the flexibility needed for diverse indexing requirements. The B-tree architecture allows for efficient range queries and point lookups, making it particularly well-suited for time-series applications where both historical analysis and real-time data access are crucial.
Start your free trial today and see how Timescale can transform your high-cardinality data management.