Sep 06, 2024
Posted by
Ryan Booz
As the makers of TimescaleDB, we often need to quickly create lots of sample time-series data to demonstrate a new database feature, run a benchmark, or talk about use cases internally. We also see users in our community Slack asking how they can test a feature or decide if their data is a good fit for TimescaleDB.
📝 If you want to know why people want to use Timescale, check this out.
Although using real data from your current application would be great (and ideal), knowing how to quickly create a representative time-series dataset using varying cardinalities and different lengths of time is a helpful - and advantageous - skill to have.
Fortunately PostgreSQL provides a built-in function to help us create this sample data using the SQL that we already know and love - no external tools required.
In this three-part blog series, we'll go through a few ways to use the generate_series()
function to create large datasets in PostgreSQL, including:
generate_series()
is and how to use it for basic data generation By the end of this series, you'll be ready to test almost any PostgreSQL or TimescaleDB feature, create quick datasets for general testing, meetup presentations, demos, and more!
generate_series()
is a built-in PostgreSQL function that makes it easy to create ordered tables of numbers or dates. The PostgreSQL documentation calls it a Set Returning Function because it can return more than one row.
The function is simple to start, taking at least two required arguments to specify the start and stop parameters for the generated data.
SELECT * FROM generate_series(1,5);
This will produce output that looks like this:
generate_series|
---------------|
1|
2|
3|
4|
5|
This first example shows that generate_series()
returns sequential numbers between a start parameter and a stop parameter. When used to generate numeric data, generate_series()
a will increment the values by 1. However, an optional third parameter can be used to specify the increment length, known as the step parameter.
For example, if we wanted to generate rows that counted by two from 0 to 10, we could use this SQL instead:
SELECT * from generate_series(0,10,2);
generate_series
-----------------
0
2
4
6
8
10
Using generate_series()
to produce a range of dates is equally straightforward. The only difference is that the third parameter, the step INTERVAL
used to increment the date, is required for date generation, as shown here:
SELECT * from generate_series(
'2021-01-01',
'2021-01-02', INTERVAL '1 hour'
);
generate_series
------------------------
2021-01-01 00:00:00+00
2021-01-01 01:00:00+00
2021-01-01 02:00:00+00
2021-01-01 03:00:00+00
2021-01-01 04:00:00+00
2021-01-01 05:00:00+00
2021-01-01 06:00:00+00
2021-01-01 07:00:00+00
2021-01-01 08:00:00+00
2021-01-01 09:00:00+00
2021-01-01 10:00:00+00
2021-01-01 11:00:00+00
2021-01-01 12:00:00+00
2021-01-01 13:00:00+00
2021-01-01 14:00:00+00
2021-01-01 15:00:00+00
2021-01-01 16:00:00+00
2021-01-01 17:00:00+00
2021-01-01 18:00:00+00
2021-01-01 19:00:00+00
2021-01-01 20:00:00+00
2021-01-01 21:00:00+00
2021-01-01 22:00:00+00
2021-01-01 23:00:00+00
2021-01-02 00:00:00+00
(25 rows)
Notice that the returned dates are inclusive of the start and stop values, just as we saw with the numeric example before. The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour INTERVAL
(the step parameter). As long as the INTERVAL
can increment evenly up to the stop date, it will be included.
However, if the step interval resulted in the stop value being skipped over, it will not be included in your output. For example, if we modify the step INTERVAL
above to '1 hour 25 minutes', the result only returns 17 rows, the last of which is before the stop value.
SELECT * from generate_series(
'2021-01-01','2021-01-02',
INTERVAL '1 hour 25 minutes'
);
generate_series
------------------------
2021-01-01 00:00:00+00
2021-01-01 01:25:00+00
2021-01-01 02:50:00+00
2021-01-01 04:15:00+00
2021-01-01 05:40:00+00
2021-01-01 07:05:00+00
2021-01-01 08:30:00+00
2021-01-01 09:55:00+00
2021-01-01 11:20:00+00
2021-01-01 12:45:00+00
2021-01-01 14:10:00+00
2021-01-01 15:35:00+00
2021-01-01 17:00:00+00
2021-01-01 18:25:00+00
2021-01-01 19:50:00+00
2021-01-01 21:15:00+00
2021-01-01 22:40:00+00
(17 rows)
Now that we understand how to use generate_series()
, how do we create some time-series data to insert into TimescaleDB for testing and visualization?
To do this, we utilize a standard feature of relational databases and SQL. Recall that generate_series()
is a Set Returning Function that returns a "table" of data (a set) just as if we had selected it from a table. Therefore, just like when we select data from a regular table with SQL, we can add more columns of data using other functions or static values.
You may have done this at some point with a string of text that needed to be repeated for each returned row.
SELECT 'Hello Timescale!' as myStr, * FROM generate_series(1,5);
myStr | generate_series
------------------+-----------------
Hello Timescale! | 1
Hello Timescale! | 2
Hello Timescale! | 3
Hello Timescale! | 4
Hello Timescale! | 5
(5 rows)
In this example, we simply added data to the rows being returned from generate_series()
. For every row it returned, we added a column with some static text.
But, these added columns don't have to be static data! Using the built-in random()
function (for example), we can generate data that starts to look a little more like data we'd see when monitoring computer CPU values (i.e., realistic data to use for our demos and tests!).
SELECT random()*100 as CPU, * FROM generate_series(1,5);
cpu | generate_series
--------------------+-----------------
48.905450626783775 | 1
71.94031820213382 | 2
25.210553719011486 | 3
19.24163308357194 | 4
8.434915599133674 | 5
(5 rows)
In this example, generate_series()
produced five rows of data; for every row, PostgreSQL also executed the random()
function to produce a value. With a little creativity, this can become a pretty efficient method for generating lots of data quickly.
Before getting too deep into the many ways we can use various functions with generate_series()
to create more realistic data, let's tackle one potential downside with functions before your mind starts thinking about a myriad of ways to produce interesting, sample data.
Functions are a powerful database tool, allowing complex code to be hidden behind a standard interface. In the example above, I don't have to know how random()
produces a value. I just have to call it in my SQL statement (without arguments) and a random value is returned for every row.
Unfortunately, functions can slow down your query and use lots of resources if you're not careful. This is true any time you call functions in SQL, not just when creating sample data.
The reason for this inefficiency is that scalar functions are executed once for each column and row in which they are used. So, if you produce a set of 1,000 rows with generate_series()
and then add on 5 additional columns with data generated by functions, PostgreSQL has to effectively process 5,001 function calls, once to generate the initial series of data which is returned as a set, and 5 times that for each row because of the additional 5 columns.
The main issue is that PostgreSQL has no idea how to determine how much "work" it will take to generate the result from each function in the query. Something like random()
required very minimal effort, so calling it 5,000 or even 1 million times won't break the bank for most machines. However, if the function you're calling generates lots of temporary data internally before producing a result, be aware that it could perform more slowly and consume more resources on your database server.
The takeaway here is that nothing comes for free, especially when functions are called once for every column, for every row. Most of the time the data you generate will easily complete in a handful of seconds for tens of millions of rows. But, if you notice a query that generates data taking a long time, consider finding alternative ways to generate data for the columns that require more resources.
There is a second caveat to be aware of when using a tool like generate_series()
. All data is created and inserted as a single transaction. You can put in more time and effort to create functions and methods for breaking up the work, but understand if you create (SELECT) one large set of data with 100 million rows, it's likely to consume a lot of memory and CPU on the server while the process occurs.
This doesn't mean it's broken or any less valid of a method for generating data. However, as we demonstrate additional ways to create more realistic data in parts 2 and 3 of this series, recognize that the more data you create, the more resources you'll need from the server without managing batches and transactions with a more advanced setup.
Alright, let's get back to the fun!
So far we've looked at how generate_series()
works and how you can use functions to add additional dynamic content to the output. But how do we quickly get to the scale of tens of millions of rows (or more)?
This is where the concept of a Cartesian product comes into play with databases. A Cartesian product (otherwise known as a CROSS JOIN) takes two or more result sets (rows from multiple tables or functions) and produces a new result set that contains rows equal to the count of the first set multiplied by the count of the second set.
In doing so, the database outputs every row from the first table with the value of the first row from the second table. It does this over and over until all rows in both tables have been iterated. (and yes, if you join more than two tables this way, the process just keeps multiplying, tables processed left to right)
Let's look at a small example using two generate_series()
in the same select statement.
SELECT * from generate_series(1,10) a, generate_series(1,2) b;
a |b|
--+-+
1|1|
2|1|
3|1|
4|1|
5|1|
6|1|
7|1|
8|1|
9|1|
10|1|
1|2|
2|2|
3|2|
4|2|
5|2|
6|2|
7|2|
8|2|
9|2|
10|2|
As you can see, PostgreSQL generated 10 rows for the first series and 2 rows for the second series. After processing and iterating over each table consecutively, the query produced a total of 20 rows (10 x 2). For generating lots of data quickly, this is about as easy as it gets!
The same process applies when using generate_series()
to return a set of dates. If you (effectively) CROSS JOIN multiple sets (numbers or dates), the total number of rows in the final set will be a product of all sets. Again, for generating sample time-series data, you'd be hard-pressed to find an easier method!
In this example, we generate 12 timestamps an hour apart, a random value representing CPU usage, and then a second series of four values that represent IDs for fake devices. This should produce 48 rows (eg. 12 timestamps x 4 device IDs = 48 rows).
SELECT time, device_id, random()*100 as cpu_usage
FROM generate_series(
'2021-01-01 00:00:00',
'2021-01-01 11:00:00',
INTERVAL '1 hour'
) as time,
generate_series(1,4) device_id;
time |device_id|cpu_usage |
-------------------+---------+-------------------+
2021-01-01 00:00:00| 1|0.35415126479989567|
2021-01-01 01:00:00| 1| 14.013393572770028|
2021-01-01 02:00:00| 1| 88.5015939122006|
2021-01-01 03:00:00| 1| 97.49037810105996|
2021-01-01 04:00:00| 1| 50.22781125586846|
2021-01-01 05:00:00| 1| 77.93431470586931|
2021-01-01 06:00:00| 1| 45.73481750582076|
2021-01-01 07:00:00| 1| 70.7999843735724|
2021-01-01 08:00:00| 1| 4.72949831884506|
2021-01-01 09:00:00| 1| 85.29122113229981|
2021-01-01 10:00:00| 1| 14.539664281598874|
2021-01-01 11:00:00| 1| 45.95244258556228|
2021-01-01 00:00:00| 2| 46.41196423062297|
2021-01-01 01:00:00| 2| 74.39903569177027|
2021-01-01 02:00:00| 2| 85.44087332221935|
2021-01-01 03:00:00| 2| 4.329394730750735|
2021-01-01 04:00:00| 2| 54.645873866589056|
2021-01-01 05:00:00| 2| 6.544334492894777|
2021-01-01 06:00:00| 2| 39.05071228953645|
2021-01-01 07:00:00| 2| 71.07264365438404|
2021-01-01 08:00:00| 2| 72.4732704336219|
2021-01-01 09:00:00| 2| 34.533280927542975|
2021-01-01 10:00:00| 2| 26.764760864598003|
2021-01-01 11:00:00| 2| 62.32048879645227|
2021-01-01 00:00:00| 3| 63.01888063314749|
2021-01-01 01:00:00| 3| 21.70606884856987|
2021-01-01 02:00:00| 3| 32.47610779097485|
2021-01-01 03:00:00| 3| 47.565982341726354|
2021-01-01 04:00:00| 3| 64.34867263419619|
2021-01-01 05:00:00| 3| 57.74424991855476|
2021-01-01 06:00:00| 3| 55.593286571750156|
2021-01-01 07:00:00| 3| 36.92650110894995|
2021-01-01 08:00:00| 3| 53.166926049881624|
2021-01-01 09:00:00| 3| 10.009505806123897|
2021-01-01 10:00:00| 3| 58.067700285561585|
2021-01-01 11:00:00| 3| 81.58883725078034|
2021-01-01 00:00:00| 4| 78.1768041898232|
2021-01-01 01:00:00| 4| 84.51505102850199|
2021-01-01 02:00:00| 4| 24.029611792753514|
2021-01-01 03:00:00| 4| 17.08996115345549|
2021-01-01 04:00:00| 4| 29.642690955760997|
2021-01-01 05:00:00| 4| 90.83844806413275|
2021-01-01 06:00:00| 4| 6.5019080489854275|
2021-01-01 07:00:00| 4| 32.336484070672|
2021-01-01 08:00:00| 4| 55.595524107963|
2021-01-01 09:00:00| 4| 97.5442141375293|
2021-01-01 10:00:00| 4| 37.0741925805568|
2021-01-01 11:00:00| 4| 19.093927249791776|
Now it's time to put all of the features and concepts together. We've seen how to use generate_series()
to create a sample table of data (both numbers and dates), add static and dynamic content to each row, and finally how to join multiple sets of data together to create a deterministic number of rows to create test data.
The final piece of the puzzle is to figure out how to create date ranges that are more dynamic using date math. Once again, PostgreSQL makes this straightforward because date math is handled automatically.
When generating sample data, you usually have an idea of the duration of time you want to generate—one month, six months, or a year—for instance. But calculating your use case's exact start and end timestamps can get pretty tedious.
Instead, it's often easier to use now()
or a static ending timestamp (ie. '2021-06-01 00:00:00'), and then using date math to get the starting timestamp based on your chosen interval. This makes it very easy to generate data for different durations simply by changing the interval. And to be clear, you can go the other direction (picking a start timestamp and adding time), but that can often lead to data in the future.
Let's look at three examples to demonstrate ways of creating dynamic date ranges.
SELECT time, device_id, random()*100 as cpu_usage
FROM generate_series(
now() - INTERVAL '6 months',
now(),
INTERVAL '1 hour'
) as time,
generate_series(1,4) device_id;
Notice that we use the PostgreSQL function now()
to automatically choose the ending timestamp, and then use date math (- INTERVAL '6 months') to let PostgreSQL find the starting timestamp for us. With almost no effort, we can easily generate weeks, months, or years of time-series data by changing the INTERVAL
we subtract from now()
.
SELECT time, device_id, random()*100 as cpu_usage
FROM generate_series(
'2021-08-01 00:00:00' - INTERVAL '6 months',
'2021-08-01 00:00:00',
INTERVAL '1 hour'
) as time,
generate_series(1,4) device_id;
This example is the same as the first, but here we specify the timestamp that we want to end on. PostgreSQL can still do the date math for us (subtracting 6 months in this example), but we get control over the exact ending timestamp to use. This is particularly useful when you want the actual time portion of the timestamp to begin and end on even, rounded hours, minutes, or seconds. When we use now()
, the timestamp can produce (what feels like) random timestamp causing all of the time-series data to have timestamps that are increments of now()
.
For this last example, we're going to specify the start timestamp instead. This can be useful when you need to test a specific fiscal period or maybe some logic that deals with the turning of each year. In this case, maybe you just need a month of data but it needs to cross over from one year to the next. In that case, trying to figure out the math of how far back to start and how far to go forward might be more complicated than you want to worry about.
SELECT time, device_id, random()*100 as cpu_usage
FROM generate_series(
'2020-12-15 00:00:00',
'2020-12-15 00:00:00' + INTERVAL '2 months',
INTERVAL '1 hour'
) as time,
generate_series(1,4) device_id;
Each of these examples uses date math to help you find the appropriate start and end timestamps so that you can easily adjust to create more or less data while still fitting the range profile that you need.
Speaking of calculating how many rows you want to generate… 😉
We now have all the tools we need to create datasets of almost any size. For time-series data specifically, it's a straightforward calculation to figure out how many rows your query will generate.
Total Rows = Readings per hour * Total hours * Number of "things" being tracked
Using this formula, we can quickly determine how many rows will be created by changing any combination of the total range (start/end timestamps), how many readings per hour for each item, or the total number of items. Let's look at a couple of examples to get an idea of how quickly you could create many rows of time-series data for your testing scenario.
Range of readings | Length of interval | Number of "devices" | Total rows |
---|---|---|---|
1 year | 1hour | 4 | 35,040 |
1 year | 10 minutes | 100 | 5,256,000 |
6 months | 5 minutes | 1,000 | 52,560,000 |
As you can see, the numbers start to add up very quickly.
In this first post, we've demonstrated that it's pretty easy to generate lots of data using the PostgreSQL generate_series()
function. We also learned that when you select multiple sets (using generate_series()
or selecting from tables and functions), PostgreSQL will produce what's known as a Cartesian product, the selection of all rows from all tables - the product of all rows. With this knowledge, we can quickly create large and diverse datasets to test various features of PostgreSQL and TimescaleDB.
This was part 1 of the three-part series:
generate_series()
. Learn how to use custom user-defined functions to create more realistic-looking data to use for testing, including generated text, numbers constrained by a range, and even fake JSON data.generate_series()
and SQL. Learn how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.If you are not using TimescaleDB yet, take a look. It's a PostgreSQL extension that will make your queries faster via automatic partitioning, query planner enhancements, improved materialized views, columnar compression, and much more. It also comes with awesome functionality for time-series data analysis.
If you're running your PostgreSQL database in your own hardware, you can simply add the TimescaleDB extension. If you prefer to try Timescale in AWS, create a free account on our platform.
See what's possible when you join PostgreSQL with time-series superpowers!