Nov 13, 2023
Posted by
Ryan Booz
Before we dive into Postgres data management, let's start with the very basics on why we love Postgres. PostgreSQL is an advanced, open-source relational database management system (RDBMS) that was designed to handle a wide range of workloads, from single-machine applications to web services with many concurrent users. It's distinguished by its robustness, scalability, and technical standards compliance—no wonder it has become the database of choice for developers.
Developed by the PostgreSQL Global Development Group, a diverse group of volunteer developers from around the world, PostgreSQL comprises a sophisticated feature set that includes complex queries, multi-version concurrency control for high performance, and robust support for SQL standards.
It also supports a variety of data types, such as JSON, XML, and arrays, as well as user-defined types, making it incredibly flexible for developers.
But, one of PostgreSQL's most appreciated features has to be its extensibility: it allows for custom functions written in different programming languages like Python, Java, or C/C++. This means you can extend the core functionality of PostgreSQL to suit your specific use cases.
Plus, PostgreSQL has a strong reputation for reliability, data integrity, and correctness. It employs a comprehensive set of replication and recovery mechanisms, ensuring your data remains safe and available. And, in terms of practical benefits for developers, its open-source nature means it's free to use and modify, boasting a vibrant community that offers a wealth of resources and support.
For all the above reasons (and more, just think about its great user experience!), we at Timescale believe that PostgreSQL is a great tool for data management. However, it needs some tweaks to accommodate the overwhelming amounts of data currently produced by applications, namely time-series data, i.e., any data that has a timestamp associated with it (order history, energy metrics, financial transactions, sensor data, and the list goes on and on).
In this post, we'll look at the challenges that time-series data poses to today's data management process, and we'll see how to fix them using trusty Postgres.
Time-series data is relentless. It grows at an ever-increasing rate and quickly becomes expensive and unwieldy to store and query. Managing your time-series data is essential in building high-performance and cost-effective applications. Built on PostgreSQL, Timescale provides the tools you need to maintain storage and query performance—without requiring you to delete all of your data—and keep tight control over your costs.
So, what does it mean to manage your time-series data? In many ways, it's similar to what application developers call Application Lifecycle Management, the continuous process of planning, developing, testing, and improving software applications. Having a continuous feedback loop in place allows a team to adjust requirements and deliver exceptional application experiences month after month.
The same is true of data management. Too often, companies choose the easiest path to launch a product, acquire and store data, and start to produce charts and actionable information... only to find that performance crumbles and costs skyrocket after the beta phase is over.
If your database solution doesn't provide the tools to help you proactively manage the entire lifecycle, from the moment data is ingested to when data must be archived or deleted for good, it's going to be hard to play catch-up when the deluge of time-series data pours in.
Managing the time-series data lifecycle will enable you to reduce costs and improve performance.
Time-series data adds another layer of complexity. Time-series data comes at you fast, sometimes generating millions of data points per second. In order to measure everything that matters, you need to capture all of the data you possibly can. But, storing and maintaining that data at scale can be difficult, whether it’s for cost, query performance, or a myriad of other reasons.
Managing data at scale is a complex task, particularly when you don't know what to expect with regard to long-term cost and performance. Finding a database that can help you build a complete solution to manage the data that matters most can certainly be challenging. Any solution, whether SQL or NoSQL, needs to effectively provide tooling to manage each phase of the lifecycle.
Any solution, whether SQL or NoSQL, needs to effectively provide tooling to manage each phase of the lifecycle. There are five steps to data lifecycle management:
Ingest and store data efficiently. When it comes to time-series data, storing data as quickly as you can and as efficiently as possible is key to success. Not all databases can effectively give you both.
Query recent data regularly. Once your data is stored, you need to query it efficiently to satisfy the business requirements of your application or monitoring solution. All database technologies have some form of a query planner that translates your query into “instructions”: fetch the data, aggregate it, and return it in the form requested. And regardless of technology, queries always work best when the data you’re working with fits into fast, physical memory.
Create aggregated historical rollups. As you ingest more and more time-series data, it’s common to want to analyze your data over longer—and longer—ranges of time (e.g., to assess trends, plan future resource needs, etc.). In most cases, this becomes untenable as raw data increases in cardinality and size because a query may need to process tens or hundreds of millions of rows.
Archive/data tier older raw data. One of the hardest parts of managing massive amounts of data is knowing how to save costs as time goes by without sacrificing the performance of near-time data or access to historical data. Typically, you can accomplish this with data storage tiering and possibly physically archiving raw data for some period of time (even if it's not immediately accessible at query time).
Drop raw data after some predefined interval passes. Sooner or later, data in its raw form becomes less and less valuable to look at in the context of what's happening today. It also slows down queries and increases storage costs. Eventually, you need an effective way to drop the raw data from your database. Once again, not all databases are created equal for such an essential task.
Data lifecycle management is even more important and more difficult when it comes to time-series data.
Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. Time-series data can track changes over milliseconds, days, or even years. As we’ve discussed before, time-series data has widespread applicability in many problem domains and industries. In all these applications, time is the primary axis on which data is stored.
Time-series data has a way of overwhelming you quickly. When it comes to the first two aspects of managing your time-series data (creating and querying raw data), we've often shown that TimescaleDB excels at ingesting data quickly and providing the features to query raw data efficiently.
(See our benchmarks for reference: TimescaleDB vs. Amazon Timestream: 6,000x higher inserts, 5-175x faster queries, 150x-220x cheaper; TimescaleDB vs. InfluxDB: Purpose-built differently for time-series data; TimescaleDB vs. MongoDB: 260% higher insert performance, up to 54x faster queries, and simpler implementation).
Thus, in order to maintain efficient query performance and keep tight control over costs, you need a time-series database with complete data management capabilities.
When we built TimescaleDB, we knew that ingesting and storing data at petabyte scale was only one critical part of an overall time-series solution. For all the reasons described above, we knew that a category-defining database for time-series data needed category-defining features for managing time-series data. From the architecture of TimescaleDB itself to numerous features purpose-built for managing time-series data, we’ve made sure that everything you need is in one (easy-to-use) product.
Hypertables provide the core foundation of the TimescaleDB architecture and, thus, unsurprisingly, enable much of the functionality for time-series data management. What are hypertables? From a user's perspective, TimescaleDB exposes what looks like singular tables into a feature called hypertables. A hypertable is the primary point of interaction with your data, as it provides the standard table abstraction that you can query via standard SQL.
Virtually all user interactions with TimescaleDB are with hypertables. Inserting, updating, or deleting data, querying data via SELECTs, altering tables, adding new columns or indexes, JOINs with other tables or hypertables, and so forth can (and should) all be executed on the hypertable.
However, hypertables are actually an abstraction or virtual view of many individual tables that actually store the data, called chunks.
Hypertables are particularly well suited for time-series data management because they already efficiently partition all incoming time-series data into smaller, more manageable sets of data. The smaller chunks of data that are created have special time-based properties that allow TimescaleDB to understand the range of data that exists in each chunk and effectively maintain each individual chunk rather than one ever-growing table. As you'll see, this small detail provides some huge benefits when working with time-series data.
This all translates into fast data ingestion that can scale to millions of rows of data per second. (Relatedly, optimizing ingest rate is critical to many common PostgreSQL use cases—but it’s even more essential for time-series scenarios, where you retain all data vs. overwrite past values. See a few of our favorite ways to speed up PostgreSQL insert performance.)
Once an application has created a suitable schema and is ingesting large quantities of time-series data, the next hurdle most users face is how to store as much raw data as possible while still executing fast queries on data spanning long time intervals.
Many databases, both relational and NoSQL, provide some form of compression at the disk or document level. This saves physical space, which allows you to store more data but querying the data still requires retrieving and scanning all rows (or properties). That is, the data itself hasn't been stored more efficiently, so as time passes and you query longer periods of data, the slower queries will typically get slower.
In contrast, TimescaleDB provides two benefits when native columnar compression is applied to a hypertable.
First, TimescaleDB modifies compressed chunks so that data is now stored by columns and not rows. Doing this can often save you 94-97% or more on disk space by deploying best-in-class algorithms for compressing various types of data, including Gorilla compression for floats, delta-of-delta, and more. TimescaleDB automatically chooses the algorithm that is the best fit for your data.
Second, because the data is now stored as individual columns, it can increase the query performance for historical aggregate data. This is possible because of how we store the compressed data. For each segmentby column in your hypertable, TimescaleDB will (effectively) group up to 1,000 rows of data at a time, for each column, as an array. This allows TimescaleDB to efficiently retrieve individual columns of data for long time periods.
As an example, let's assume that the CPU table below is partitioned into chunks every seven (7) days (what TimescaleDB calls the chunk_time_interval of a chunk). If we store one reading every minute for 1,000 devices, a chunk would store approximately 10,000 rows per device over those seven days (and ~10 million rows in total per chunk).
Without compression, the example query below would have to scan ~30 million rows of data (which would likely retrieve all columns) to attain the average cpu usage.
SELECT time_bucket('time','1 hour') bucket,
avg(cpu_usage) cpu_avg
FROM cpu
WHERE time <= now() - INTERVAL '7 days'
AND time > now() - INTERVAL '1 month'
After compression, however, the same query would only have to scan ~30 thousand rows of data (.001% of the uncompressed query) for just one column. This is significantly less work and can dramatically improve query responses on some queries.
With native compression, you can store more data for longer periods at a lower cost, providing you greater flexibility on how you maintain your data. And, for some of your queries, the columnar format could increase the performance of your queries on older data.
Performing analysis on time-series data often involves using aggregate functions to observe trends over time—functions like SUM(), AVG(), MIN(), MAX(), etc. These functions become slower over time as more and more data is aggregated simply because more data exists. Regardless of how big and fast your server may be, reading tens of millions of records to find the average hourly CPU usage over a month will still take time and resources.
One potential solution is the materialized view feature of PostgreSQL, which persists the results of the query to disk as if it were a regular table. This is better than a regular view in PostgreSQL that executes the underlying query every time the view is referenced (which doesn't reduce any server load over time). The main limitation of materialized views is that they require additional development work to automatically stay up-to-date (e.g., utilizing triggers and stored procedures).
In contrast, TimescaleDB provides a feature called continuous aggregates that enhances materialized views so that the aggregated data is intelligently kept up-to-date, refreshing only the portions of the view where underlying data has changed on a schedule of your choice. If values were updated or deleted, only the time range that was modified would be recomputed.
This is a powerful feature that can improve the performance of historical queries. Additionally, because the data is persisted to disk, it can be used as a long-term store of the aggregated data. Even when older raw data is eventually dropped from the hypertable, the continuous aggregate retains the historical aggregation of that data.
Therefore, continuous aggregates provide multiple benefits in the context of your data lifecycle. They provide an efficient way to query aggregated data (reducing query load on the raw data), and they provide long-term storage of aggregated data decoupled from any modifications to the underlying raw data.
Even with features like native compression and continuous aggregates, there are still times when your application needs access to raw data for very long periods of time, even if that means older queries take longer. In most databases or data lakes, this is a feature known as data tiering; in Timescale, we call our unified storage backend Tiered Storage.
Essentially, with Tiered Storage, you can move data to cheaper (and often slower) storage once it has reached some age threshold. The data isn't specifically archived at this point, but in many cases, this is seen as the first step toward archiving raw data.
The best way to scale infinitely with PostgreSQL is simply moving your old data to Amazon S3. This feature is only available for our fully managed cloud solution and trial services: start a Timescale trial and get full access to the platform for 30 days, no credit card required.
If you're self-hosting, TimescaleDB works in tandem with the PostgreSQL feature known as tablespaces. Postgres tablespaces allow you to specify different locations on disk where PostgreSQL can store the data files that make up your database. When used with different classes of storage, administrators can implement policies that move older data onto cheaper (but slower) disks to save money, although queries on older data will likely be slower.
For self-hosted TimescaleDB, you can use tablespaces in conjunction with the move_chunk function to move data and indexes over time.
Once you've used compression to store more data for longer periods of time, created continuous aggregates to improve query performance, and maybe even tiered some data from your growing time-series dataset, a well-planned strategy for managing time-series data requires that you consider when it's appropriate to physically drop raw data from your database. At some point, querying older raw data on a regular basis will take additional resources and provide decreasing value with time-series data.
TimescaleDB provides built-in functionality to set up data retention policies. With a straightforward API, users can set up a policy that will drop chunks from a hypertable once the data inside of that chunk reaches a certain age. Because TimescaleDB knows the minimum and maximum timestamp for every chunk, it can efficiently find chunks that can be safely dropped from the database.
Dropping chunks as the primary function of data retention also helps overcome common problems in PostgreSQL (and other relational databases) when the alternative is issuing a DELETE for some range of data.
Deleting rows of data is slow, prone to locking issues, negatively impacts index maintenance, and requires more work through the use of VACUUM. With hypertable chunks, however, all indexes are local to that child table, so no index maintenance is needed, and the operation is as fast as dropping a regular PostgreSQL table—nearly instantaneous.
Setting up a data retention policy is done using the APIs that come with TimescaleDB. If you have a hypertable called conditions and want to drop chunks of data once they are six months old, this SQL query will create the automated policy.
SELECT add_retention_policy('conditions',INTERVAL '6 months',true);
Once you execute this command, TimescaleDB will begin checking once a day for chunks that can be dropped.
The final piece of a solid strategy for managing your time-series data is customization—having the ability to fine-tune existing schedules or craft new actions to deal with your data in a specific way that the built-in TimescaleDB policies don't currently support. If you use vanilla PostgreSQL or another time-series database, job scheduling is not typically available without installing additional extensions.
TimescaleDB, on the other hand, provides direct access to its internal job scheduling framework through a feature called user-defined actions, enabling you to write and implement fully customizable procedures to work with your data on your terms. This unlocks a world of possibilities for using all of the principles of managing time-series data together to craft just the right level of access, compression, downsampling, and archiving of data.
With user-defined actions, you create the stored procedure and accept a generic config parameter that accepts JSONB. Within that config object, you can add any parameters needed for your script to work with.
Let's look at a few examples that we provide in our documentation.
With this example, we create a short stored procedure to drop chunks from all hypertables after a configurable interval. This has the advantage of not requiring a data retention policy for each hypertable as shown earlier in this article. In this example, the script sets one parameter from the config object, drop_after, which is the interval to use for dropping chunks.
CREATE OR REPLACE PROCEDURE generic_retention (job_id int, config jsonb)
LANGUAGE PLPGSQL
AS $$
DECLARE
drop_after interval;
BEGIN
SELECT jsonb_object_field_text (config, 'drop_after')::interval INTO STRICT drop_after;
IF drop_after IS NULL THEN
RAISE EXCEPTION 'Config must have drop_after';
END IF;
PERFORM drop_chunks(format('%I.%I', table_schema, table_name), older_than => drop_after)
FROM timescaledb_information.hypertables;
END
$$;
This script could be modified for any number of data retention scenarios specific to your use case. For instance, if you have different schemas for different clients or different types of data, you could provide configuration parameters that specify unique drop_after properties for each schema.
Another example we provide in our documentation relates to one of the tenets of data lifecycle management we discussed above—archiving old data.
This sample stored procedure sets three parameters based on the config JSONB input: hypertable, lag, and destination. These parameters are used to find chunks that are older than a specified interval (lag ) in order to move them to the specified tablespace using standard ALTER TABLE syntax.
CREATE OR REPLACE PROCEDURE custom_move_chunks (job_id int, config jsonb)
LANGUAGE PLPGSQL
AS $$
DECLARE
ht REGCLASS;
lag interval;
destination name;
chunk REGCLASS;
tmp_name name;
BEGIN
SELECT jsonb_object_field_text (config, 'hypertable')::regclass INTO STRICT ht;
SELECT jsonb_object_field_text (config, 'lag')::interval INTO STRICT lag;
SELECT jsonb_object_field_text (config, 'tablespace') INTO STRICT destination;
IF ht IS NULL OR lag IS NULL OR destination IS NULL THEN
RAISE EXCEPTION 'Config must have hypertable, lag and destination';
END IF;
FOR chunk IN
SELECT show.oid
FROM show_chunks(ht, older_than => lag)
SHOW (oid)
INNER JOIN pg_class pgc ON pgc.oid = show.oid
INNER JOIN pg_tablespace pgts ON pgts.oid = pgc.reltablespace
WHERE pgts.spcname != destination;
LOOP
RAISE NOTICE 'Moving chunk: %', chunk::text;
EXECUTE format('ALTER TABLE %s SET TABLESPACE %I;', chunk, destination);
END LOOP;
END
$$;
Once created, we can schedule it to be run once a day with the following SQL query.
SELECT add_job('custom_move_chunks','1d', config => '{"hypertable":"metrics","lag":"12 month","tablespace":"old_chunks"}');
Again, as you look at this sample stored procedure, there are numerous ways that you could modify this procedure to fit your data archiving requirements.
For instance, you might create a table to log every time a chunk is moved to cheaper storage. This information could be used to bill a customer differently for different tiers of storage, or it might be valuable for customer support if a customer reports slower query results (i.e., is the slower query time because data lives in a “slow” data tier?).
With user-defined actions, you create the workflows you need to manage and track changes in your TimescaleDB database.
Creating and consuming data has never been easier, especially time-series data. Development teams should create a plan to proactively manage the amount of time-series data that is stored, downsampled, archived, and eventually dropped from the database—before the amount of data becomes overwhelming. This will enable you to maintain high performance and better control costs.
With TimescaleDB, the tools you need to create a robust plan are available immediately, out-of-the-box, for free. And if you don’t see the specific functionality or API that your situation requires, you can use user-defined actions to create policies and manage your data exactly as you need.
If you’re new to TimescaleDB, you can try TimescaleDB for free with our hosted service or download it and run it locally or in your own cloud infrastructure.
If you’re an existing user, you have access to all of the above functionality automatically, assuming you’re running the latest version. Need to update? See our documentation for upgrade instructions.
From there, join our Slack community to ask questions and meet other time-series data enthusiasts around the world (you’ll find other developer advocates, Timescale engineers, and our amazing community members active in all channels).