Nov 22, 2024
Posted by
Sarah Conway
As a PostgreSQL enthusiast, it’s been very interesting to join Timescale and learn about TimescaleDB, a PostgreSQL extension with numerous solutions for handling large volumes of data. While Timescale has excellent documentation, having a quick and easy getting started guide is always nice.
Scaling PostgreSQL via partitioning is a challenge for many PostgreSQL users, but TimescaleDB has a core feature to help solve it. Here’s a straightforward breakdown of hypertables, a great feature designed to significantly improve query performance on massive volumes of data. A benchmark vs. vanilla PostgreSQL is also included.
(Looking for more details? Check out Timescale’s official docs on hypertables.)
Generally speaking, hypertables are designed to make working with time-series data easy by automatically partitioning regular PostgreSQL tables into smaller data partitions or chunks. There are a ton of different examples time-series data, including IoT applications, weather data, financial data analysis, and system monitoring.
Simply put: time-series data is any data that has been collected at regular or irregular intervals that is now associated with timestamps.
In this blog post, we’ll set up a basic IoT sensor dataset and compare the differences in query run time between regular PostgreSQL tables and hypertables.
For real-time dashboards and alerts, this kind of data would need to be collected frequently; efficient querying is key for optimal performance and timely rendering of the results. This is a perfect example use case for hypertables. Why? They enable you to do the following:
1. Handle high-frequency inserts efficiently
2. Optimize queries for time-based data, and
3. Ensure fast query performance, especially at scale
So, how do you actually use them?
First, let's create a hypertable to store our sensor data:
CREATE TABLE iot_sensor_data (
sensor_id INTEGER NOT NULL,
ts TIMESTAMPTZ NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION,
battery_level DOUBLE PRECISION,
PRIMARY KEY (sensor_id, ts)
);
Now, let’s convert that to a hypertable.
SELECT create_hypertable('iot_sensor_data', ‘ts’);
This creates a hypertable optimized for time-series data that is partitioned on the ts column. The TimescaleDB extension automatically manages this for you.
Here's how you might insert data into your hypertable:
INSERT INTO iot_sensor_data (sensor_id, ts, temperature, humidity, battery_level)
VALUES
(1, '2023-01-01 00:00:00+00', 22.5, 60.2, 85.0),
(1, '2023-01-01 00:01:00+00', 23.1, 59.8, 84.7),
(1, '2023-01-01 00:02:00+00', 21.9, 61.0, 84.3),
(1, '2023-01-01 00:03:00+00', 22.8, 60.5, 84.1),
(1, '2023-01-01 00:04:00+00', 22.0, 61.3, 83.9);
In practice, you'd have a script or application inserting data at regular intervals (such as every minute or as data arrives).
One of my favorite features of hypertables is that they automatically partition time-series data for you without a lot of extra work on your part. This automatic partitioning means you can work with hypertables as if they were just normal PostgreSQL tables. Behind the scenes, they do the heavy lifting of partitioning your PostgreSQL tables and speeding up your queries and data ingestion rate while keeping up with your data as it scales.
As part of this functionality, when you create a hypertable, indexes are automatically created along with it. By default, you’ll notice a descending index on time and an index on the space parameter and time (if there’s a space partition created). Let’s check it out.
List the associated indexes:
SELECT
indexname AS index_name,
indexdef AS index_definition
FROM
pg_indexes
WHERE
tablename = 'iot_sensor_data';
And you should see the following results:
[
{
"index_name": "iot_sensor_data_pkey",
"index_definition": "CREATE UNIQUE INDEX iot_sensor_data_pkey ON public.iot_sensor_data USING btree (sensor_id, ts)"
},
{
"index_name": "iot_sensor_data_ts_idx",
"index_definition": "CREATE INDEX iot_sensor_data_ts_idx ON public.iot_sensor_data USING btree (ts DESC)"
}
]
It is possible, of course, to disable the default behavior of creating indexes automatically; just set create_default_indexes
, to false when creating the hypertable, like so:
SELECT create_hypertable('iot_sensor_data', 'ts', create_default_indexes => false);
Additionally, take note that if you’d like to configure unique indexes, you must reference all the partitioning columns for the table. This is a requirement in order to ensure unique constraints apply across partitions based on the time column.
For example, to create a unique index on both sensor_id
and ts
, you would run:
CREATE UNIQUE INDEX ON iot_sensor_data (sensor_id, ts);
Here's a simple benchmark you can run to compare query performance:
Generate a time series for each minute for a month, and associate each timestamp with 100 devices. This will insert approximately 4,320,100 rows (100 sensors * 60 minutes/hour * 24 hours/day * 31 days).
INSERT INTO iot_sensor_data (sensor_id, ts, temperature, humidity, battery_level)
SELECT
sensor_id,
ts,
20 + random() * 10, -- Temperature between 20 and 30°C
50 + random() * 10, -- Humidity between 50% and 60%
20 + random() * 20 -- Battery level between 20% and 100%
FROM generate_series(1, 100) AS sensor_id, -- Generate 100 distinct sensor IDs
generate_series(
'2024-09-01 00:00:00+00'::timestamptz,
'2024-10-01 00:00:00+00',
'1 minute'::interval
) AS ts;
Now let’s create a regular table with the same structure as our hypertable:
CREATE TABLE regular_iot_sensor_data (
sensor_id INTEGER NOT NULL,
ts TIMESTAMPTZ NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION,
battery_level DOUBLE PRECISION,
PRIMARY KEY (sensor_id, ts)
);
Let’s reproduce the same additional index that exists on our hypertable on this regular table:
CREATE INDEX regular_iot_sensor_data_ts_idx
ON public.regular_iot_sensor_data USING btree (ts DESC);
Copy over the records we just inserted into the hypertable into regular_iot_sensor_data
:
INSERT INTO regular_iot_sensor_data (sensor_id, ts, temperature, humidity, battery_level)
SELECT * FROM iot_sensor_data;
Then, check the results of EXPLAIN ANALYZE
against each table, starting with the hypertable:
EXPLAIN ANALYZE
SELECT AVG(temperature)
FROM iot_sensor_data
WHERE ts BETWEEN '2024-09-01' AND '2024-09-05';
Which returns:
Finalize Aggregate (cost=10112.93..10112.94 rows=1 width=8) (actual time=50.737..58.663 rows=1 loops=1)
-> Gather (cost=1112.49..10112.92 rows=4 width=32) (actual time=50.650..58.650 rows=4 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=112.49..9112.52 rows=2 width=32) (actual time=31.895..45.882 rows=1 loops=3)
-> Partial Aggregate (cost=112.49..112.50 rows=1 width=32) (actual time=1.865..1.866 rows=1 loops=1)
-> Parallel Bitmap Heap Scan on _hyper_52_268_chunk (cost=2.55..112.34 rows=59 width=8) (actual time=0.444..1.848 rows=100 loops=1)
Recheck Cond: ((ts >= '2024-09-01 00:00:00+00'::timestamp with time zone) AND (ts <= '2024-09-05 00:00:00+00'::timestamp with time zone))
-> Bitmap Index Scan on _hyper_52_268_chunk_iot_sensor_data_ts_idx (cost=0.00..2.52 rows=100 width=0) (actual time=0.413..0.413 rows=100 loops=1)
Index Cond: ((ts >= '2024-09-01 00:00:00+00'::timestamp with time zone) AND (ts <= '2024-09-05 00:00:00+00'::timestamp with time zone))
-> Partial Aggregate (cost=9000.00..9000.01 rows=1 width=32) (actual time=45.256..45.256 rows=1 loops=3)
-> Parallel Seq Scan on _hyper_52_267_chunk (cost=0.00..8400.00 rows=240000 width=8) (actual time=0.018..32.989 rows=192000 loops=3)
Filter: ((ts >= '2024-09-01 00:00:00+00'::timestamp with time zone) AND (ts <= '2024-09-05 00:00:00+00'::timestamp with time zone))
Planning Time: 0.409 ms
Execution Time: 58.710 ms
Then do the same against the regular table:
EXPLAIN ANALYZE
SELECT AVG(temperature)
FROM regular_iot_sensor_data
WHERE ts BETWEEN '2024-09-01' AND '2024-09-05';
That returns:
Finalize Aggregate (cost=17131.34..17131.35 rows=1 width=8) (actual time=135.917..145.755 rows=1 loops=1)
-> Gather (cost=17131.13..17131.34 rows=2 width=32) (actual time=135.724..145.744 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=16131.13..16131.14 rows=1 width=32) (actual time=130.502..130.503 rows=1 loops=3)
-> Parallel Index Scan using regular_iot_sensor_data_ts_idx on regular_iot_sensor_data (cost=0.43..15581.22 rows=219961 width=8) (actual time=0.039..107.362 rows=192033 loops=3)
Index Cond: ((ts >= '2024-09-01 00:00:00+00'::timestamp with time zone) AND (ts <= '2024-09-05 00:00:00+00'::timestamp with time zone))
Planning Time: 0.348 ms
Execution Time: 145.799 ms
As your datasets grow, you’ll notice even more benefits to query performance. Already, we can see quite a difference—about a 148.3% increase in execution time for a regular table, from 58.710 ms (left) to 145.799 ms (right):
For fun, I just ran a simple SELECT * FROM
query against both the iot_sensor_data
and regular_iot_sensor_data
tables to compare the difference in query execution times (no limit).
The iot_sensor_data
table returned results in 1762.9 s vs. 1940.6 s for the regular table. That’s about 10 % slower for the regular table, just retrieving results—without even performing any calculations.
Hypertables are great for storing and querying large volumes of time-stamped data, whether you’re just looking for a performance boost or hoping to enable real-time analytics and monitoring at scale.
We covered a number of useful features that are unlocked when using hypertables in TimescaleDB. There’s even more that can be researched to fully take advantage of all the benefits that hypertables provide. So, here’s a complete list:
Look out for more articles like this to see other real-world examples of TimescaleDB features like continuous aggregates and compression.
Interested in trying out hypertables yourself? If you’re self-hosting, install the open source TimescaleDB extension—else, you can sign up for the fully managed Timescale Cloud solution free for 30 days (with no credit card required).