Dec 05, 2024
Posted by
Jônatas Davi Paganini
If you've been working with PostgreSQL, you've probably seen memes advocating for denormalized counters instead of counting related records on demand. The debate usually looks like this:
-- The "don't do this" approach: counting related records on demand
SELECT COUNT(*) FROM post_likes WHERE post_id = $1;
-- The "do this instead" approach: maintaining a denormalized counter
SELECT likes_count FROM posts WHERE post_id = $1;
Let's break down these approaches. In the first approach, we calculate the like count by scanning the post_likes
table each time we need the number. In the second approach, we maintain a pre-calculated counter in the posts
table which we update whenever someone likes or unlikes a post.
The denormalized counter approach is often recommended for OLTP (online transaction processing) workloads because it trades write overhead for read performance. Instead of executing a potentially expensive COUNT
query that needs to scan the entire post_likes
table, we can quickly fetch a pre-calculated number.
This is particularly valuable in social media applications, where like counts are frequently displayed but rarely updated—you're showing like counts on posts much more frequently than users are actually liking posts.
However, when we enter the world of time-series data and high-frequency updates, this conventional wisdom needs a second look. Let me share an example that made me reconsider this approach while working with a PostgreSQL database optimized for time series via the TimescaleDB extension.
While this advice might make sense for traditional OLTP workloads, when working with time-series data in TimescaleDB, we need to take a different approach to data modeling.
Let's start with a common scenario: tracking post likes in a social media application. The traditional data denormalization approach might look like this:
-- Traditional table structure
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
content TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
likes_count INTEGER DEFAULT 0
);
CREATE TABLE post_likes (
post_id INTEGER REFERENCES posts(id),
user_id INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (post_id, user_id)
);
With this structure, every like operation requires two updates:
-- When a user likes a post
BEGIN;
INSERT INTO post_likes (post_id, user_id) VALUES (1, 123);
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;
COMMIT;
While this might seem efficient at first glance, it introduces several problems:
1. VACUUM overhead: Every update to likes_count
creates a new version of the row in the posts table. PostgreSQL's MVCC (multiversion concurrency control) means old versions aren't immediately removed, leading to the following:
-- Check bloat in posts table
SELECT schemaname, relname, n_dead_tup, n_live_tup, last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'posts';
2. Transaction contention: Multiple concurrent likes on the same post create lock contention on the posts
row.
This is one of those cases where TimescaleDB can give PostgreSQL a helping hand. Instead of maintaining a running counter, let's leverage TimescaleDB's strengths. We’ll start by using a hypertable to partition the data automatically by the time column.
-- Create a hypertable for post_likes
CREATE TABLE post_likes (
post_id INTEGER,
user_id INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (post_id, user_id, created_at)
);
SELECT create_hypertable('post_likes', by_range('created_at', INTERVAL '1 month'));
The TimescaleDB extension will automatically create new child tables and split them into several partitions, in this case, one per month. Check the key performance advantage of adopting hypertables:
It’s important to remember that the hypertable architecture is a paradigm shift in database partitioning. That’s because the partition stores its own table statistics and indices, making the policies faster for dropping entire partitions without any extra work for vacuum or updates.
Parallelizing will not avoid rescanning the full dataset for any necessary statistics. To increase efficiency, we can consider grouping data hourly and processing it hour by hour. Vanilla PostgreSQL does not allow partial refreshes on materialized views, which is why Timescale developed the continuous aggregation feature.
The continuous aggregate will maintain pre-computed counts. Instead of computing counts during query time or updating every new like, we can create a materialized view with superpowers.
-- Create a view for hourly like counts
CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous) AS
SELECT
post_id,
time_bucket('1 hour', created_at) AS bucket,
count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);
-- Set refresh policy
SELECT add_continuous_aggregate_policy('post_likes_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
The refresh policy makes it run on a schedule and only refreshes the part that has not been computed yet. Through a “watermark” mechanism, the refresh time is stored, and the data is updated from the latest watermark point. You can read more about it in our dev’s intro to continuous aggregates.
You may be thinking, “What? But what if I change the raw data?” TimescaleDB can also track it and refresh only the updated parts.
If you like this idea, you'll probably also love the ability to use continuous aggregates hierarchically.
-- Get likes trend over time
SELECT
post_id,
bucket,
likes_count,
sum(likes_count) OVER (PARTITION BY post_id ORDER BY bucket) as cumulative_likes
FROM post_likes_hourly
WHERE post_id = 1
ORDER BY bucket DESC;
Let's benchmark both approaches:
-- Traditional approach
EXPLAIN ANALYZE
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;
-- TimescaleDB approach
EXPLAIN ANALYZE
INSERT INTO post_likes (post_id, user_id, created_at)
VALUES (1, 123, NOW());
The TimescaleDB approach shows better performance characteristics under high concurrency and provides more analytical capabilities.
For applications requiring real-time counts, we can set the materialized view parameter timescaledb.materialized_only=false
to refresh the view on demand.
CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous, timescaledb.materialized_only=false) AS
SELECT
post_id,
time_bucket('1 hour', created_at) AS bucket,
count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);
Behind the scenes, TimescaleDB will create a hypertable for the materialized view and refresh the view according to the refresh policy. When the refresh starts, it saves a watermark to track the latest refreshed bucket.
When you query the posts_liks_hourly
, it combines the materialized data with the latest bucket from the hypertable filtering only on the buckets greater than the watermark. It means that instead of scanning the raw dataset, it will just process the part that has not materialized yet.
Now that we have a continuous aggregate, we need to establish a retention policy to prevent the hypertable from growing indefinitely. As we're storing the data in chunks, we can set a retention policy to delete the chunks that are older than a certain period.
SELECT add_retention_policy('post_likes_hourly', INTERVAL '1 month');
This command runs a background job that deletes chunks older than one month. The past data will be deleted in the background, and the continuous aggregate will remain up to date.
Also, the data will be removed only when the entire partition is going to be dropped. Every partition has its own metadata, without any need to update statistics or give any extra work for the VACUUM process.
While denormalized counters might seem appealing for simple OLTP workloads, TimescaleDB's time-series capabilities offer a more scalable and maintainable solution. By leveraging continuous aggregates and proper time-series modeling, we can achieve better performance, richer analytics, and more reliable data management.
Remember:
This approach might require a mindset shift, but the benefits in terms of scalability and maintenance make it worthwhile for time-series workloads. To give TimescaleDB a try, install it on your machine. If you prefer a mature, managed PostgreSQL platform that delivers even more scalability, you can try Timescale Cloud for free.