Written by Dylan Paulus
How we design and lay out our tables is one of the most important decisions in developing effective and efficient PostgreSQL databases. A database schema is the architecture or structure of the data in our database. It defines what tables we'll have, columns, data types, naming, and relationships between tables.
Database schemas are not to be confused with PostgreSQL schemas, which are instead used as a namespace mechanism to organize databases.
Database schemas are a broad concept, not PostgreSQL-specific. The choice of data types, the columns each table will have, and the relationships between tables significantly affect how fast we can query data, write data, and maintain our databases. In this article, we'll walk through two scenarios detailing different database schema designs and cover different approaches and trade-offs to different schema designs.
Before we even start to design a database schema, we first need to know the requirements of our system. Do we need to optimize for a large amount of write operations? Does data rarely change, and can we optimize heavy read operations? How data gets read, maintained, and accessed will dramatically change our approach to data modeling and what trade-offs we're willing to take.
In previous articles, we explored common database schema design patterns using wide vs. medium vs. narrow tables and single vs. multiple tables. Both articles outline different advantages and disadvantages when designing a database schema.
To recap:
Narrow tables: They have few "data value columns" but can contain many metadata or categorizing columns. Narrow tables lead nicely to have many small tables.
Wide tables: They have many data value columns. These tables can easily have 200+ columns containing different data points. Wide tables will generally hold all data points, so we will have a single or few tables.
Medium tables: These are a happy median between narrow and wide tables.
Similar techniques apply to database normalization. Normalization is the process of organizing data and relationships to reduce redundancy and data inconsistencies. It is important to follow normalization to reduce complexity and increase maintainability, but there are cases when we would want to denormalize our tables.
For instances where having multiple joins is costly and read operations need to be fast, it will be faster to have a single wide table with all the data we need vs. many normalized narrow tables. Narrow vs. wide tables, normalization vs. denormalization, and single vs. multiple tables are all tools in our toolbox when designing database schemas.
We will walk through two scenarios to solidify our approach to designing database schemas. Each will have a different set of requirements, solutions, and pitfalls:
1. I'll provide background to the type of data we'll model.
2. We'll create a database schema based on the background given.
3. The scenarios will add new information or "feature creep" that changes the schema.
4. We'll talk about the pros and cons of the approach taken.
We have an application that reads temperature data from weather stations worldwide every minute. Customers use the application to get the hourly average temperature for a given city. Being time-series data, the data will be append-only. The data will be updated or deleted infrequently. In this case, the weather stations require a significant investment to measure different data points, so the schema will rarely change.
Based on the background information, we know customers will query our system by location when they look up the average temperature. Because of this, location will be the primary search parameter and may require an index. Our schema will also rarely change.
We know up-front what data each weather station is collecting (only temperature), so we will use a wide, single table to store all the data. A single table will keep queries faster than multiple tables, which requires many joins. With these details in mind, we'll create a temperatures
table containing four columns: created
to record the timestamp of the measurement, degrees
to hold the temperature in degrees and city
and country
to know where the temperature is measured.
Our SQL would look like this:
CREATE TABLE temperatures (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
degrees DOUBLE PRECISION,
city TEXT NOT NULL,
country TEXT NOT NULL
);
CREATE INDEX idx_location ON temperatures (country, city);
While inspecting one of the weather stations, an engineer found that each weather station can also measure wind speed and direction. We want to add this to our application. We'll continue the wide, single-table approach to include wind speed and direction in our temperature readings for ease of use and because our sensor schema stays relatively consistent.
Let's evolve the temperatures
schema to include a wind_speed
and wind_direction
column:
CREATE TABLE temperatures (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
degrees DOUBLE PRECISION,
city TEXT NOT NULL,
country TEXT NOT NULL
wind_speed DOUBLE PREVISION,
wind_direction TEXT -- N, NE, E, SE, S, SW, W, NW
);
CREATE INDEX idx_location ON temperatures (country, city);
As with every schema design, our implementation has pros and cons. In this example, we prioritized quick reads and inserts by having a single table containing all the data. A single table is also easier to maintain.
Pros
We have very few tables to maintain, lowering our system complexity and ensuring our queries are fast.
It is very easy to correlate wind measurements and temperatures for a given city.
For example, having separate tables, we would need to be cautious of the created
timestamp differing and not being consistent between tables.
Cons
As we add data to the temperatures
table, the context or the "model" of the table changes—this is something we would need to address as our schema changes.
For example, the table name temperatures
doesn't match the data in the table anymore.
If the database schema were to become volatile, as in we need to add or remove columns frequently, then it can quickly become unmaintainable to maintain in a single table.
As the lead engineer, you get tasked with creating a system for handling the company's website analytics. The data gets read into reports, which get updated once per day.
Though there are few read queries to the database, the website is constantly writing analytics to monitor and ensure customers have a smooth experience. Requirements are continually changing—stakeholders want new insights into how the website is performing and how customers are using the website.
The biggest keyword to pick up on in this scenario is "requirements are constantly changing." We know the database schema will rapidly evolve, so we want complete flexibility in the database schema. To do this, we will use the narrow, many-table approach. Normalizing our tables will ensure the reports have no duplication, inserting data is fast, and we can easily add and drop new metrics as needed. Until there is more information on how reports are generated, we won't add an index to the tables. Prematurely adding an index can cause our tables to bloat and add overhead to database operations.
Our SQL would look like this:
CREATE TABLE pages (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name TEXT,
);
CREATE TABLE page_load_speed (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
time_ms DOUBLE PRECISION,
page_id UUID NOT NULL,
CONSTRAINT fk_pages FOREIGN KEY(page_id) REFERENCES pages(id)
);
CREATE TABLE time_to_first_paint (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
time_ms DOUBLE PRECISION,
page_id UUID NOT NULL,
CONSTRAINT fk_pages FOREIGN KEY(page_id) REFERENCES pages(id)
);
With the success of our analytics application, the company has decided to start offering it as a service to external users. We need to add multi-tenancy to the tables to support multiple customers. To do this, we will update each analytics table to include a customer_id
column that can query for customer-specific analytics.
Updating the original CREATE TABLE
statements, the new schema would look like this:
CREATE TABLE customers (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name TEXT,
);
CREATE TABLE pages (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name TEXT,
customer_id UUID NOT NULL,
CONSTRAINT fk_customer FOREIGN KEY(customer_id) REFERENCES customers(id)
);
CREATE TABLE page_load_speed (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
time_ms DOUBLE PRECISION,
page_id UUID NOT NULL,
customer_id UUID NOT NULL,
CONSTRAINT fk_pages FOREIGN KEY(page_id) REFERENCES pages(id),
CONSTRAINT fk_customer FOREIGN KEY(customer_id) REFERENCES customers(id)
);
CREATE TABLE time_to_first_paint (
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
time_ms DOUBLE PRECISION,
page_id UUID NOT NULL,
customer_id UUID NOT NULL,
CONSTRAINT fk_pages FOREIGN KEY(page_id) REFERENCES pages(id),
CONSTRAINT fk_customer FOREIGN KEY(customer_id) REFERENCES customers(id)
);
So that our schema can be flexible to accommodate a wide array of analytic data, we chose to use many narrow tables. As we saw when we added multi-tenancy, one of the drawbacks of many narrow tables is when we have overall schema updates. Every table needs to be updated. If we have a large database, this can create a maintenance nightmare.
Pros
The schema is flexible, and we can easily add new metrics by creating new tables.
The data is normalized, reducing duplication in our data set.
Cons
Queries may grow out of control if we need to join against a lot of tables to gather metrics (e.g., the Cartesian explosion problem).
Having many small tables can add operational complexity to a system.
In our example, if the page_load_speed
, time_to_first_paint
, and pages
tables each had one million rows, it would take quite a while to update each table to include customer_id
individually.
In this article, we looked at two different scenarios and examples while designing database schemas. We also saw how to apply wide vs. narrow schema design and single vs. multiple tables. Not every database schema is perfect. Systems constantly evolve, and our approach to database schemas needs to adapt.
Knowing what your system needs and balancing the trade-offs is crucial. By keeping a handle on system requirements, we can craft database schemas that fit the bill today and handle new demands as they arise.
To learn more about data modeling and designing your database schema, keep reading:
If you want to start designing your hypertable database schema as soon as possible, ensuring you get the best performance and user experience for your time-series data while achieving incredible compression ratios, create a free Timescale account.