Start supercharging your PostgreSQL today.
Written by Dylan Paulus
Managing a database is more than just storing information—it's about storing data efficiently, effectively, and balancing trade-offs. We can employ techniques like data normalization to optimize our data for maintainability and readability, or data denormalization to optimize for raw query speed. Data normalization is the process of breaking data down into structured tables to reduce duplication and make it easier to query and store by enforcing standardization.
In this article, we'll take a look at what data normalization is, how to apply normal forms to achieve data normalization, challenges and tools for data normalization in PostgreSQL, tips on when to denormalize, and how TimescaleDB makes normalization easier.
The primary benefit of normalizing data is to reduce duplication, but this is just the tip of the iceberg. By reducing data duplication, we can achieve the following benefits:
Improve cost: Storing less data means we don't need to pay for holding on to that extra data.
Simplify maintenance: Having non-duplicated data means updating that data only needs to occur in a single place, helping maintain data consistency and data integrity.
Enhances security: By segmenting data into smaller, related tables, you can apply more granular access controls.
Improve scalability: Keeping data organized and structured into smaller tables makes sharding and data access easier as your data scales.
The benefits of normalizing your data are endless, and you'll find that by going through the process of applying the normal forms your data will be easier to query, maintain, and scale over time.
It is worth taking a minute to talk about data anomalies. When we talk about "normalization makes databases easier to maintain," what we're talking about is preventing data anomalies. What is a data anomaly? Anomalies occur when the same data value is stored in multiple places, and a mutation (insert, update, delete) modifies the data in one place but not everywhere. Leading to confusion about what a value should be, and causing possible data loss.
For example, say we had a table to track customer orders which looks like this:
customer_name | address | order_amount | order_items |
John | USA | 10.50 | hair gel, comb |
Gary | UK | 25.66 | backpack |
John | USA | 2.30 | pencil |
If we wanted to delete all orders by John, we would lose all the previous order information that, as a company, we might want to maintain. This is a data loss or a delete anomaly. Deleting information about John
also deletes previous order information.
customer_name | address | order_amount | order_items |
Gary | UK | 25.66 | backpack |
John then moved addresses from USA
to UK
before their orders were shipped. To reflect John's move, we then need to update every row where customer_name = John
. Since the update occurs over multiple rows, there is a chance we don't update all the rows—especially if updates happen through application code. When there are multiple sources of truth, and an update partially changes some of the rows, we end up with an update anomaly.
customer_name | address | order_amount | order_items |
John | UK | 10.50 | hair gel, comb |
Gary | UK | 25.66 | backpack |
John | USA | 2.30 | pencil |
Normalizing your databases helps avoid anomalies before they can ever occur.
Data normalization follows a set of rules called normal forms. The normal forms start with the first normal form, then the second normal form, and so on. Each normal form builds off the previous normal form. The first three normal forms are the most important, and the ones most databases follow, but many more exist. In this section, we'll look at the first three normal forms, plus the Boyce-Codd normal form—which expands on the third normal form.
First normal form is the first step in reducing data duplication. A database is in first normal form if the following criteria apply:
A column only contains one value (or, in other words, the column is atomic)
No row repeats in the table
To walk through the various normal forms, we'll start with a completely denormalized table and move through the different normal forms until we have a clean, normalized database.
Let's take the following denormalized `logs table for a web application monitoring service:
This table violates first normal form. First, the tags column contains multiple values (e.g., frontend, backend
) so some columns contain multiple values. Second, the first two rows (CRASH | frontend, backend
) have exactly identical data. Nothing is distinct about them. We can solve this by introducing a primary key to make each row unique.
Each row is distinct from the other solving "no repeating rows." Next, we can remove multiple values from the tags
column by creating two new tables. First, a table to hold all available tags. Second, a joining table will facilitate a many-to-many relationship between the logs
and the tags
.
With these three tables, our data is now in first normal form: no two rows repeat and each column contains exactly one value.
To be in second normal form, a table must follow these criteria:
The table must be in first normal form.
All columns are fully dependent on the candidate key.
The wording you'll find around second normal form can be confusing. Put simply, second normal form is generally a rule around tables with composite keys (when two or more columns are combined to form the primary key). The columns in a table must depend on all composite key columns, not just one or a few of the composite keys.
You might be thinking, what does it mean for a column to depend
on another column? Seen as dependent or functionally dependent, this means that knowing the value of one column can determine another column. For example, knowing the course_id
in a learning platform database can give you the course name
or description
. name
is dependent on course_id
. description
is also dependent on course_id
.
Another team has implemented an archiving feature to mark logs as archived in our monitoring web app service. The new tables look like this:
We get tasked with checking if the changes to the database are normalized. The tables are still in first normal form, but we notice that with the change second normal form is violated. Why? In the log_tags
table, the newly added is_archived
column marks a log as either archived or not. This flag has nothing to do with tags—tags don't get archived. Since is_archived
is dependent on logs
and not tags
, this table is not in second normal form. Additionally, we can see that is_archived
's value is repeated for log_id = 1
, log_id = 2
, and log_id = 3
.
Luckily this is a simple fix. If we move the column to the dependent table—in this case, move is_archived
to the logs
table, the table will be in second normal form.
With is_archived
moved to the logs
table our database is not only in first normal form, but now in second normal form.
To be in third normal form, the following criteria must apply:
The data must be in first and second normal form.
There are no transitive dependencies in the table.
A transitive dependency is when a column is dependent on a column that is not the primary key. For example, a package delivery service may create a table for deliveries containing an id
, person
, and address
. In this example, the address
is dependent on the person
, which in turn is dependent on the id
. This is a transitive dependency. address
is transitively dependent on the primary key id
through person
(address
-> person
-> id
).
Logs in our metrics system don't give enough information on what is happening. We get assigned with adding a description to each log. Easy enough, we'll add a new column to the logs table to store descriptions:
The update follows the first normal form: there are no repeating rows, and all columns contain a single value. Additionally, all tables follow second normal form, as all columns in their respective tables are dependent on the primary key. But we fail to adhere to the third normal form because description
is dependent on slug
instead of the primary key id
. We can fix this by moving slug
and description
into their own table.
Creating a new levels
table puts our database into 3NF. Each column in their respective table is solely dependent on the primary key without any transitive dependencies.
Most database systems can apply the first three normal forms and have an extremely clean, normalized data model. However, a loophole exists in very infrequent occurrences where third normal form can still result in a table not fully normalized. The loophole is patched with a normal form called the Boyce-Codd normal form. Since the Boyce-Codd normal form is seen as an extension to the third normal form or a more strict version of 3NF, you will commonly see it referred to as the 3.5 normal form. To be in Boyce-Codd normal form, your table must comply with the following:
It must be in first, second, and third normal form.
Every value in a table should be dependent on every candidate key.
Back in our metrics system, we decide we want to associate levels
to an owner and add severity. Each owner is determined by the severity of the level
.
This current iteration satisfies all the normal forms we've implemented up to this point.
1NF: All values are atomic.
2NF: There are no partial dependencies (all attributes depend on the entire primary key).
3NF: There are no transitive dependencies through non-prime attributes.
However, since owner
is dependent on severity
and severity
is not a candidate key, this table design violates Boyce-Codd's normal form. To fix this, we can split severity
and owner
into separate tables.
With a new category_reviewers
table created, and using severity
as the primary key, the database is fully in first normal form, second normal form, third normal form, and Boyce-Codd normal form.
Normalization reduces data duplication by organizing data into separate tables with proper relationships. For example, instead of repeatedly storing a customer's address in every order record, you store it once in a customer table and reference it through foreign keys. This not only saves storage space but also prevents update anomalies, where the same data might be updated in some places but not others.
When data is normalized, updates only need to be made in one place. If a customer changes their address, you only update it in the customers table rather than hunting down every order record where that address appears.
While some believe normalization always hurts performance, it can actually improve it in write-heavy applications. Smaller, focused tables with proper indexing often perform better than large denormalized tables, especially for updates and inserts. Join operations between properly normalized and indexed tables can be very efficient.
Adding new types of data or changing existing structures is simpler in a normalized database. For instance, if you need to add support for multiple shipping addresses per customer, this is much easier when addresses are already in their own table rather than embedded in order records.
Normalization enforces referential integrity through foreign key constraints. This prevents orphaned records and ensures data relationships remain valid. For example, you can't delete a customer record if there are still orders referencing it, preventing inconsistent data states.
When it comes to data normalization in PostgreSQL, even experienced database administrators face several common challenges. Let's dive into these challenges and explore practical solutions that can help you build more robust and efficient databases.
One of the most significant challenges in data normalization is managing complex relationships between different entities. Imagine your use case is an e-commerce platform where products can belong to multiple categories, have various attributes and maintain price histories. This complexity can quickly become overwhelming.
A practical solution is to implement bridge tables effectively. Instead of creating a tangled web of direct relationships, use intermediate tables to maintain clean many-to-many relationships. For example:
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(255),
base_price DECIMAL(10,2)
);
CREATE TABLE categories (
category_id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
CREATE TABLE product_categories (
product_id INTEGER REFERENCES products(product_id),
category_id INTEGER REFERENCES categories(category_id),
PRIMARY KEY (product_id, category_id)
);
CREATE TABLE price_history (
product_id INTEGER REFERENCES products(product_id),
price DECIMAL(10,2),
effective_date TIMESTAMPTZ,
PRIMARY KEY (product_id, effective_date)
);
This structure maintains data integrity while keeping relationships clean and manageable. The bridge table product_categories
allows products to belong to multiple categories without violating normalization principles.
A common misconception is that higher normalization levels always lead to slower query performance. While it's true that joins can impact performance, denormalization isn't always the answer. The key is finding the right balance for your specific use case.
Here are some strategies you can use for maintaining performance with normalized data:
Use appropriate indexing strategies. Create indexes on frequently joined columns and columns used in WHERE
clauses:
CREATE INDEX idx_product_categories_product_id ON product_categories(product_id);
CREATE INDEX idx_product_categories_category_id ON product_categories(category_id);
Implement materialized views for complex queries that are run frequently but don't need real-time data:
CREATE MATERIALIZED VIEW product_category_summary AS
SELECT p.name AS product_name,
string_agg(c.name, ', ') AS categories,
p.base_price
FROM products p
JOIN product_categories pc ON p.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id
GROUP BY p.product_id, p.name, p.base_price;
Successful data modeling in PostgreSQL requires a thoughtful approach that considers both present needs and future scalability. Start with a basic model and progressively refine it based on actual usage patterns. For instance, if you're tracking user interactions with products, you might start with a simple events table:
CREATE TABLE user_events (
event_id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id),
product_id INTEGER REFERENCES products(product_id),
event_type VARCHAR(50),
event_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
As your needs evolve, you can split this into more specialized tables without breaking existing applications:
CREATE TABLE product_views (
view_id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id),
product_id INTEGER REFERENCES products(product_id),
view_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
session_duration INTEGER
);
CREATE TABLE product_purchases (
purchase_id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id),
product_id INTEGER REFERENCES products(product_id),
purchase_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
quantity INTEGER,
purchase_price DECIMAL(10,2)
);
Use PostgreSQL's robust constraint system to maintain data integrity. This includes not just primary and foreign keys, but also CHECK constraints and unique constraints:
ALTER TABLE products
ADD CONSTRAINT price_check
CHECK (base_price >= 0);
ALTER TABLE product_purchases
ADD CONSTRAINT quantity_check
CHECK (quantity > 0);
Having your data normalized is the goal for most PostgreSQL databases for the reasons we previously looked at, but getting there has its challenges.
First, it is difficult and takes time to fully flesh out model complex data relationships. Practicing applying normal forms to denormalized or partially normalized tables can help reduce the complexity of data normalization. Additionally, drawing Entity Relationship Diagrams (ERDs) to get a visual understanding of your data can greatly help in data normalization. Most diagramming tools can create ERD visualization—a popular free one being diagrams.net (draw.io). An example of an ERD visualization using the logs
table from the previous normal forms section would look like:
A second challenge of normalization is balancing performance with maintainability. We will cover this more in the section about denormalization, but by the process of normalizing data, we end up with smaller, contained tables. To gain meaning out of the data we need to write queries with multiple joins to piece the data together again. Having multiple joins in a query has a negative effect on queries.
Throughout this article, we've talked about the benefits of data normalization and how to go about it, but there are scenarios when we would not want to normalize our database. By going through data normalization we break down data into smaller, purpose-built tables to reduce duplication. By doing this, when we write queries we'll need to perform joins to stitch data back together. Joins aren't free and come at a performance cost. PostgreSQL has done a lot to optimize joins operations, but at the end of the day, the only thing faster than a join is not using joins.
In cases where we need the utmost speed, we can sacrifice maintainability for query speed. This is done by denormalizing data. Just like how we apply the normal forms to normalize data, undoing the normal forms (or doing the opposite) will denormalize data. This looks like the following:
Storing multiple data values in a single column
Duplicating data between rows
Storing unrelated data in a single row
TimescaleDB uses some of these data denormalization techniques to speed up query performance and reduce disk size through compression.
When it comes to picking the best tools for the job, when working with time-series data (or other challenging workloads, like real-time analytics, events, or even vector data) there is no better database than TimescaleDB. Using TimescaleDB simplifies the trade-offs between normalizing and denormalizing data (though you should still evaluate your data models!) through automated compression, continuous aggregates, and partitioning.
In TimescaleDB, compression combines multiple rows in a chunk into a single row, denormalizing the table by reversing first normal form. By compressing multiple rows into a single row, we reduce disk space because a single row takes up less space, but we also speed up queries because no joins are needed to get all the information in a time range.
Features like compression and more allow time-series workloads to scale while maintaining the simplicity of normalized data. In sum, you should choose TimescaleDB over PostgreSQL to normalize your data if your use case ticks one of these boxes:
You are working with time-series data or any schema where time is a significant dimension.
You require long-term storage and efficient compression for normalized time-series datasets.
You frequently perform time-based aggregations or joins across normalized tables.
You want built-in automation for partitioning, retention policies, and continuous aggregates.
You need to scale time-series workloads while maintaining the logical simplicity of normalized schemas.
Data normalization is a necessary step in making databases maintainable, cost-effective, and performant. In this article, we explored the process of normalizing a database through the normal forms, the pros and cons of normalization, and when you might want to denormalize a table.
Knowing when to normalize and when the denormalize is an important skill when designing databases—balancing performance with maintainability. TimescaleDB strikes a balance by providing tools to easily maintain the simplicity and maintainability of normalization but with the performance benefits of denormalization.
You can self-host TimescaleDB or leave the worries of managing data infrastructure behind with Timescale Cloud. Get it for free (no credit card required) for 30 days on AWS, Azure, or GCP.