How to Use PostgreSQL for Data Normalization

Try for free

Start supercharging your PostgreSQL today.

Written by Dylan Paulus

Managing a database is more than just storing information—it's about storing data efficiently, effectively, and balancing trade-offs. We can employ techniques like data normalization to optimize our data for maintainability and readability, or data denormalization to optimize for raw query speed. Data normalization is the process of breaking data down into structured tables to reduce duplication and make it easier to query and store by enforcing standardization. 

In this article, we'll take a look at what data normalization is, how to apply normal forms to achieve data normalization, challenges and tools for data normalization in PostgreSQL, tips on when to denormalize, and how TimescaleDB makes normalization easier.

Why Data Normalization?

The primary benefit of normalizing data is to reduce duplication, but this is just the tip of the iceberg. By reducing data duplication, we can achieve the following benefits:

  • Improve cost: Storing less data means we don't need to pay for holding on to that extra data.

  • Simplify maintenance: Having non-duplicated data means updating that data only needs to occur in a single place, helping maintain data consistency and data integrity.

  • Enhances security: By segmenting data into smaller, related tables, you can apply more granular access controls.

  • Improve scalability: Keeping data organized and structured into smaller tables makes sharding and data access easier as your data scales.

The benefits of normalizing your data are endless, and you'll find that by going through the process of applying the normal forms your data will be easier to query, maintain, and scale over time.

Anomalies

It is worth taking a minute to talk about data anomalies. When we talk about "normalization makes databases easier to maintain," what we're talking about is preventing data anomalies. What is a data anomaly? Anomalies occur when the same data value is stored in multiple places, and a mutation (insert, update, delete) modifies the data in one place but not everywhere. Leading to confusion about what a value should be, and causing possible data loss.

For example, say we had a table to track customer orders which looks like this:

customer_name

address

order_amount

order_items

John

USA

10.50

hair gel, comb

Gary

UK

25.66

backpack

John

USA

2.30

pencil

If we wanted to delete all orders by John, we would lose all the previous order information that, as a company, we might want to maintain. This is a data loss or a delete anomaly. Deleting information about John also deletes previous order information.

customer_name

address

order_amount

order_items

Gary

UK

25.66

backpack

John then moved addresses from USA to UK before their orders were shipped. To reflect John's move, we then need to update every row where customer_name = John. Since the update occurs over multiple rows, there is a chance we don't update all the rows—especially if updates happen through application code. When there are multiple sources of truth, and an update partially changes some of the rows, we end up with an update anomaly.

customer_name

address

order_amount

order_items

John

UK

10.50

hair gel, comb

Gary

UK

25.66

backpack

John

USA

2.30

pencil

Normalizing your databases helps avoid anomalies before they can ever occur.

The Normal Forms

Data normalization follows a set of rules called normal forms. The normal forms start with the first normal form, then the second normal form, and so on. Each normal form builds off the previous normal form. The first three normal forms are the most important, and the ones most databases follow, but many more exist. In this section, we'll look at the first three normal forms, plus the Boyce-Codd normal form—which expands on the third normal form.

First normal form (1NF)

First normal form is the first step in reducing data duplication. A database is in first normal form if the following criteria apply:

  • A column only contains one value (or, in other words, the column is atomic)

  • No row repeats in the table

Example

To walk through the various normal forms, we'll start with a completely denormalized table and move through the different normal forms until we have a clean, normalized database.

Let's take the following denormalized `logs table for a web application monitoring service:

This table violates first normal form. First, the tags column contains multiple values (e.g., frontend, backend) so some columns contain multiple values. Second, the first two rows (CRASH | frontend, backend) have exactly identical data. Nothing is distinct about them. We can solve this by introducing a primary key to make each row unique.

Each row is distinct from the other solving "no repeating rows." Next, we can remove multiple values from the tags column by creating two new tables. First, a table to hold all available tags. Second, a joining table will facilitate a many-to-many relationship between the logs and the tags.

With these three tables, our data is now in first normal form: no two rows repeat and each column contains exactly one value.

Second normal form (2NF)

To be in second normal form, a table must follow these criteria:

  • The table must be in first normal form.

  • All columns are fully dependent on the candidate key.

The wording you'll find around second normal form can be confusing. Put simply, second normal form is generally a rule around tables with composite keys (when two or more columns are combined to form the primary key). The columns in a table must depend on all composite key columns, not just one or a few of the composite keys. 

You might be thinking, what does it mean for a column to depend on another column? Seen as dependent or functionally dependent, this means that knowing the value of one column can determine another column. For example, knowing the course_id in a learning platform database can give you the course name or description. name is dependent on course_id. description is also dependent on course_id.

Example

Another team has implemented an archiving feature to mark logs as archived in our monitoring web app service. The new tables look like this:

We get tasked with checking if the changes to the database are normalized. The tables are still in first normal form, but we notice that with the change second normal form is violated. Why? In the log_tags table, the newly added is_archived column marks a log as either archived or not. This flag has nothing to do with tags—tags don't get archived. Since is_archived is dependent on logs and not tags, this table is not in second normal form. Additionally, we can see that is_archived's value is repeated for log_id = 1, log_id = 2, and log_id = 3

Luckily this is a simple fix. If we move the column to the dependent table—in this case, move is_archived to the logs table, the table will be in second normal form.

With is_archived moved to the logs table our database is not only in first normal form, but now in second normal form.

Third normal form (3NF)

To be in third normal form, the following criteria must apply:

  • The data must be in first and second normal form.

  • There are no transitive dependencies in the table.

A transitive dependency is when a column is dependent on a column that is not the primary key. For example, a package delivery service may create a table for deliveries containing an id, person, and address. In this example, the address is dependent on the person, which in turn is dependent on the id. This is a transitive dependency. address is transitively dependent on the primary key id through person (address -> person -> id).

Example

Logs in our metrics system don't give enough information on what is happening. We get assigned with adding a description to each log. Easy enough, we'll add a new column to the logs table to store descriptions:

The update follows the first normal form: there are no repeating rows, and all columns contain a single value. Additionally, all tables follow second normal form, as all columns in their respective tables are dependent on the primary key. But we fail to adhere to the third normal form because description is dependent on slug instead of the primary key id. We can fix this by moving slug and description into their own table.

Creating a new levels table puts our database into 3NF. Each column in their respective table is solely dependent on the primary key without any transitive dependencies.

Boyce-Codd normal form (3.5NF)

Most database systems can apply the first three normal forms and have an extremely clean, normalized data model. However, a loophole exists in very infrequent occurrences where third normal form can still result in a table not fully normalized. The loophole is patched with a normal form called the Boyce-Codd normal form. Since the Boyce-Codd normal form is seen as an extension to the third normal form or a more strict version of 3NF, you will commonly see it referred to as the 3.5 normal form. To be in Boyce-Codd normal form, your table must comply with the following:

  • It must be in first, second, and third normal form.

  • Every value in a table should be dependent on every candidate key.

Example

Back in our metrics system, we decide we want to associate levels to an owner and add severity. Each owner is determined by the severity of the level.

This current iteration satisfies all the normal forms we've implemented up to this point.

  • 1NF: All values are atomic.

  • 2NF: There are no partial dependencies (all attributes depend on the entire primary key).

  • 3NF: There are no transitive dependencies through non-prime attributes.

However, since owner is dependent on severity and severity is not a candidate key, this table design violates Boyce-Codd's normal form. To fix this, we can split severity and owner into separate tables.

With a new category_reviewers table created, and using severity as the primary key, the database is fully in first normal form, second normal form, third normal form, and Boyce-Codd normal form.

Benefits of Data Normalization in PostgreSQL

Database integrity and reduced redundancy

Normalization reduces data duplication by organizing data into separate tables with proper relationships. For example, instead of repeatedly storing a customer's address in every order record, you store it once in a customer table and reference it through foreign keys. This not only saves storage space but also prevents update anomalies, where the same data might be updated in some places but not others.

Simplified data maintenance

When data is normalized, updates only need to be made in one place. If a customer changes their address, you only update it in the customers table rather than hunting down every order record where that address appears.

Better query performance

While some believe normalization always hurts performance, it can actually improve it in write-heavy applications. Smaller, focused tables with proper indexing often perform better than large denormalized tables, especially for updates and inserts. Join operations between properly normalized and indexed tables can be very efficient.

Easier data modifications

Adding new types of data or changing existing structures is simpler in a normalized database. For instance, if you need to add support for multiple shipping addresses per customer, this is much easier when addresses are already in their own table rather than embedded in order records.

Improved data consistency

Normalization enforces referential integrity through foreign key constraints. This prevents orphaned records and ensures data relationships remain valid. For example, you can't delete a customer record if there are still orders referencing it, preventing inconsistent data states.

Challenges and Solutions for Data Normalization in PostgreSQL

When it comes to data normalization in PostgreSQL, even experienced database administrators face several common challenges. Let's dive into these challenges and explore practical solutions that can help you build more robust and efficient databases.

Handling complex data relationships

One of the most significant challenges in data normalization is managing complex relationships between different entities. Imagine your use case is an e-commerce platform where products can belong to multiple categories, have various attributes and maintain price histories. This complexity can quickly become overwhelming.

A practical solution is to implement bridge tables effectively. Instead of creating a tangled web of direct relationships, use intermediate tables to maintain clean many-to-many relationships. For example:

CREATE TABLE products (     product_id SERIAL PRIMARY KEY,     name VARCHAR(255),     base_price DECIMAL(10,2) );

CREATE TABLE categories (     category_id SERIAL PRIMARY KEY,     name VARCHAR(255) );

CREATE TABLE product_categories (     product_id INTEGER REFERENCES products(product_id),     category_id INTEGER REFERENCES categories(category_id),     PRIMARY KEY (product_id, category_id) );

CREATE TABLE price_history (     product_id INTEGER REFERENCES products(product_id),     price DECIMAL(10,2),     effective_date TIMESTAMPTZ,     PRIMARY KEY (product_id, effective_date) );

This structure maintains data integrity while keeping relationships clean and manageable. The bridge table product_categories allows products to belong to multiple categories without violating normalization principles.

The performance trade-off

A common misconception is that higher normalization levels always lead to slower query performance. While it's true that joins can impact performance, denormalization isn't always the answer. The key is finding the right balance for your specific use case.

Here are some strategies you can use for maintaining performance with normalized data:

Use appropriate indexing strategies. Create indexes on frequently joined columns and columns used in WHERE clauses:

CREATE INDEX idx_product_categories_product_id ON product_categories(product_id); CREATE INDEX idx_product_categories_category_id ON product_categories(category_id);

Implement materialized views for complex queries that are run frequently but don't need real-time data:

CREATE MATERIALIZED VIEW product_category_summary AS SELECT p.name AS product_name,        string_agg(c.name, ', ') AS categories,        p.base_price FROM products p JOIN product_categories pc ON p.product_id = pc.product_id JOIN categories c ON pc.category_id = c.category_id GROUP BY p.product_id, p.name, p.base_price;

Effective Data Modeling Strategies

Successful data modeling in PostgreSQL requires a thoughtful approach that considers both present needs and future scalability. Start with a basic model and progressively refine it based on actual usage patterns. For instance, if you're tracking user interactions with products, you might start with a simple events table:

CREATE TABLE user_events (     event_id SERIAL PRIMARY KEY,     user_id INTEGER REFERENCES users(user_id),     product_id INTEGER REFERENCES products(product_id),     event_type VARCHAR(50),     event_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP );

As your needs evolve, you can split this into more specialized tables without breaking existing applications:

CREATE TABLE product_views (     view_id SERIAL PRIMARY KEY,     user_id INTEGER REFERENCES users(user_id),     product_id INTEGER REFERENCES products(product_id),     view_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,     session_duration INTEGER );

CREATE TABLE product_purchases (     purchase_id SERIAL PRIMARY KEY,     user_id INTEGER REFERENCES users(user_id),     product_id INTEGER REFERENCES products(product_id),     purchase_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,     quantity INTEGER,     purchase_price DECIMAL(10,2) );

Constraint management

Use PostgreSQL's robust constraint system to maintain data integrity. This includes not just primary and foreign keys, but also CHECK constraints and unique constraints:

ALTER TABLE products     ADD CONSTRAINT price_check      CHECK (base_price >= 0);

ALTER TABLE product_purchases     ADD CONSTRAINT quantity_check      CHECK (quantity > 0);

Tools for Data Normalization in PostgreSQL

Having your data normalized is the goal for most PostgreSQL databases for the reasons we previously looked at, but getting there has its challenges. 

First, it is difficult and takes time to fully flesh out model complex data relationships. Practicing applying normal forms to denormalized or partially normalized tables can help reduce the complexity of data normalization. Additionally, drawing Entity Relationship Diagrams (ERDs) to get a visual understanding of your data can greatly help in data normalization. Most diagramming tools can create ERD visualization—a popular free one being diagrams.net (draw.io). An example of an ERD visualization using the logs table from the previous normal forms section would look like:

A second challenge of normalization is balancing performance with maintainability. We will cover this more in the section about denormalization, but by the process of normalizing data, we end up with smaller, contained tables. To gain meaning out of the data we need to write queries with multiple joins to piece the data together again. Having multiple joins in a query has a negative effect on queries.

Denormalization: When and why 

Throughout this article, we've talked about the benefits of data normalization and how to go about it, but there are scenarios when we would not want to normalize our database. By going through data normalization we break down data into smaller, purpose-built tables to reduce duplication. By doing this, when we write queries we'll need to perform joins to stitch data back together. Joins aren't free and come at a performance cost. PostgreSQL has done a lot to optimize joins operations, but at the end of the day, the only thing faster than a join is not using joins. 

In cases where we need the utmost speed, we can sacrifice maintainability for query speed. This is done by denormalizing data. Just like how we apply the normal forms to normalize data, undoing the normal forms (or doing the opposite) will denormalize data. This looks like the following:

  • Storing multiple data values in a single column

  • Duplicating data between rows

  • Storing unrelated data in a single row

TimescaleDB uses some of these data denormalization techniques to speed up query performance and reduce disk size through compression.

TimescaleDB over PostgreSQL for data normalization

When it comes to picking the best tools for the job, when working with time-series data (or other challenging workloads, like real-time analytics, events, or even vector data) there is no better database than TimescaleDB. Using TimescaleDB simplifies the trade-offs between normalizing and denormalizing data (though you should still evaluate your data models!) through automated compression, continuous aggregates, and partitioning.

In TimescaleDB, compression combines multiple rows in a chunk into a single row, denormalizing the table by reversing first normal form. By compressing multiple rows into a single row, we reduce disk space because a single row takes up less space, but we also speed up queries because no joins are needed to get all the information in a time range.

Features like compression and more allow time-series workloads to scale while maintaining the simplicity of normalized data. In sum, you should choose TimescaleDB over PostgreSQL to normalize your data if your use case ticks one of these boxes:

  • You are working with time-series data or any schema where time is a significant dimension.

  • You require long-term storage and efficient compression for normalized time-series datasets.

  • You frequently perform time-based aggregations or joins across normalized tables.

  • You want built-in automation for partitioning, retention policies, and continuous aggregates.

  • You need to scale time-series workloads while maintaining the logical simplicity of normalized schemas.

Conclusion

Data normalization is a necessary step in making databases maintainable, cost-effective, and performant. In this article, we explored the process of normalizing a database through the normal forms, the pros and cons of normalization, and when you might want to denormalize a table. 

Knowing when to normalize and when the denormalize is an important skill when designing databases—balancing performance with maintainability. TimescaleDB strikes a balance by providing tools to easily maintain the simplicity and maintainability of normalization but with the performance benefits of denormalization.

You can self-host TimescaleDB or leave the worries of managing data infrastructure behind with Timescale Cloud. Get it for free (no credit card required) for 30 days on AWS, Azure, or GCP.

Read more: