Generating More Realistic Sample Time-Series Data With PostgreSQL generate_series()
In this three-part series on generating sample time-series data, we demonstrate how to use the built-in PostgreSQL function, generate_series()
, to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.
In part 1 of the series, we reviewed how generate_series()
works, including the ability to join multiple series into a larger table of time-series data - through a feature known as a CROSS (or Cartesian) JOIN. We ended the first post by showing you how to quickly calculate the number of rows a query will produce and modify the parameters for generate_series()
to fine-tune the size and shape of the data.
However, there was one problem with the data we could produce at the end of the first post. The data that we were able to generate was very basic and not very realistic. Without more effort, using functions like random()
to generate values doesn't provide much control over precisely what numbers are produced, so the data still feels more fake than we might want.
This second post will demonstrate a few ways to create more realistic-looking data beyond a column or two of random decimal values. Read on for more.
In part 3 of this blog series adds one final tool to the mix - combining the data formatting techniques below with additional equations and relational data to shape your sample time-series output into something that more closely resembles real-life applications.
By the end of this series, you'll be ready to test almost any feature that TimescaleDB offers and create quick datasets for your testing and demos!
A brief review of generate_series()
In the first post, we demonstrated how generate_series()
(a Set Returning Function) could quickly create a data set based on a range of numeric values or dates. The generated data is essentially an in-memory table that can quickly create large sets of sample data.
-- create a series of values, 1 through 5, incrementing by 1
SELECT * FROM generate_series(1,5);
generate_series|
---------------|
1|
2|
3|
4|
5|
-- generate a series of timestamps, incrementing by 1 hour
SELECT * from generate_series('2021-01-01','2021-01-02', INTERVAL '1 hour');
generate_series
------------------------
2021-01-01 00:00:00+00
2021-01-01 01:00:00+00
2021-01-01 02:00:00+00
2021-01-01 03:00:00+00
2021-01-01 04:00:00+00
...
We then discussed how the data quickly becomes more complex as we join the various sets together (along with some value returning functions) to create a multiple of both sets together.
This example from the first post joined a timestamp set, a numeric set, and the random()
function to create fake CPU data for four fake devices over time.
-- there is an implicit CROSS JOIN between the two generate_series() sets
SELECT time, device_id, random()*100 as cpu_usage
FROM generate_series('2021-01-01 00:00:00','2021-01-01 04:00:00',INTERVAL '1 hour') as time,
generate_series(1,4) device_id;
time |device_id|cpu_usage |
-------------------+---------+-------------------+
2021-01-01 00:00:00| 1|0.35415126479989567|
2021-01-01 01:00:00| 1| 14.013393572770028|
2021-01-01 02:00:00| 1| 88.5015939122006|
2021-01-01 03:00:00| 1| 97.49037810105996|
2021-01-01 04:00:00| 1| 50.22781125586846|
2021-01-01 00:00:00| 2| 46.41196423062297|
2021-01-01 01:00:00| 2| 74.39903569177027|
2021-01-01 02:00:00| 2| 85.44087332221935|
2021-01-01 03:00:00| 2| 4.329394730750735|
2021-01-01 04:00:00| 2| 54.645873866589056|
2021-01-01 00:00:00| 3| 63.01888063314749|
2021-01-01 01:00:00| 3| 21.70606884856987|
2021-01-01 02:00:00| 3| 32.47610779097485|
2021-01-01 03:00:00| 3| 47.565982341726354|
2021-01-01 04:00:00| 3| 64.34867263419619|
2021-01-01 00:00:00| 4| 78.1768041898232|
2021-01-01 01:00:00| 4| 84.51505102850199|
2021-01-01 02:00:00| 4| 24.029611792753514|
2021-01-01 03:00:00| 4| 17.08996115345549|
2021-01-01 04:00:00| 4| 29.642690955760997|
And finally, we talked about how to calculate the total number of rows your query would generate based on the time range, the interval between timestamps, and the number of "things" for which you are creating fake data.
Range of readings | Length of interval | Number of "devices" | Total rows |
---|---|---|---|
1 year | 1 hour | 4 | 35,040 |
1 year | 10 minutes | 100 | 5,256,000 |
6 months | 5 minutes | 1,000 | 52,560,000 |
Still, the main problem remains. Even if we can generate 50 million rows of data with a few lines of SQL, the data we generate isn't very realistic. It's all random numbers, with lots of decimals and minimal variation.
As we saw in the query above (generating fake CPU data), any columns of data that we add to the SELECT query are added to each row of the resulting set. If we add static text (like 'Hello, Timescale!'), that text is repeated for every row. Likewise, adding a function as a column value will be called one time for each row of the final set.
That's what happened with the random()
function in the CPU data example. Every row has a different value because the function is called separately for each row of generated data. We can use this to our advantage to begin making the data look more realistic.
With a little more thought and custom PostgreSQL functions, we can start to bring our sample data "to life."
What is realistic data?
This feels like a good time to make sure we're on the same page. What do I mean by "realistic" data?
Using the basic techniques we've already discussed allows you to create a lot of data quickly. In most cases, however, you often know what the data you're trying to explore looks like. It's probably not a bunch of decimal or integer values. Even if the data you're trying to mimic are just numeric values, they likely have valid ranges and maybe a predictable frequency.
Take our simple example of CPU and temperature data from above. With just two fields, we have a few choices to make if we want the generated data to feel more realistic.
- Is CPU a percentage? Out of 100% or are we representing multi-core CPUs that can present as 200%, 400%, or 800%?
- Is temperature measured in Fahrenheit or Celsius? What are reasonable values for CPU temperature in each unit? Do we store temperature with decimals or as an integer in the schema?
- What if we added a "note" field to the schema for messages that our monitoring software might add to the readings from time to time? Would every reading have a note or just when a threshold was reached? Is there a special diagnostic message at the top of each hour that we need to replicate in some way?
Using random()
and static text by themselves allows us to generate lots of data with many columns, but it's not going to be very interesting or as useful in testing features in the database.
That's the goal of the second and third posts in this series, helping you to produce sample data that looks more like the real thing without much extra work. Yes, it will still be random, but it will be random within constraints that help you feel more connected to the data as you explore various aspects of time-series data.
And, by using functions, all of the work is easily reusable from table to table.
Walk before you run
In each of the examples below, we'll approach our solutions much as we learned in elementary math class: show your work! It's often difficult to create a function or procedure in PostgreSQL without playing with a plain SQL statement first. This abstracts away the need to think about function inputs and outputs at the outset so that we can focus on how the SQL works to produce the value we want.
Therefore, the examples below show you how to get a value (random numbers, text, JSON, etc.) in a SELECT statement first before converting the SQL into a function that can be reused. This kind of iterative process is a great way to learn features of PostgreSQL, particularly when it's combined with generate_series()
.
So, take one foot and put it in front of the other, and let's start creating better sample data.
Creating more realistic numbers
In time-series data, numeric values are often the most common data type. Using a function like random()
without any other formatting creates very… well... random (and precise) numbers with lots of decimal points. While it works, the values aren't realistic. Most users and devices aren't tracking CPU usage to 12+ decimals. We need a way to manipulate and constrain the final value that's returned in the query.
For numeric values, PostgreSQL provides many built-in functions to modify the output. In many cases, using round()
and floor()
with basic arithmetic can quickly start shaping the data in a way that better fits your schema and use case.
Let's modify the example query for getting device metrics, returning values for CPU and temperature. We want to update the query to ensure that the data values are "customized" for each column, returning values within a specific range and precision. Therefore, we need to apply a standard formula to each numeric value in our SELECT query.
Final value = random() * (max allowed value - min allowed value) + min allowed value
This equation will always generate a decimal value between (and inclusive of) the min and max value. If random()
returns a value of 1, the final output will equal the maximum value. If random()
returns a value of 0, then the result will equal the minimum value. Any other number that random()
returns will produce some output between the min and max values.
Depending on whether we want a decimal or integer value, we can further format the "final value" of our formula with round()
and floor()
.
This example produces a reading every minute for one hour for 10 devices. The cpu value will always fall between 3 and 100 (with four decimals of precision), and the temperature will always be an integer between 28 and 83.
SELECT
time,
device_id,
round((random()* (100-3) + 3)::NUMERIC, 4) AS cpu,
floor(random()* (83-28) + 28)::INTEGER AS tempc
FROM
generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time,
generate_series(1,10,1) AS device_id;
time |device_id|cpu |tempc |
-----------------------------+---------+-------+-------------+
2021-11-03 12:47:01.181 -0400| 1|53.7301| 61|
2021-11-03 12:48:01.181 -0400| 1|34.7655| 46|
2021-11-03 12:49:01.181 -0400| 1|78.6849| 44|
2021-11-03 12:50:01.181 -0400| 1|95.5484| 64|
2021-11-03 12:51:01.181 -0400| 1|86.3073| 82|
…|...|...|...
By using our simple formula and formatting the result correctly, the query produced the "curated" output (random as it is) we wanted.
The power of functions
But there's also a bit of a letdown here, isn't there? Typing that formula repeatedly for each value - trying to remember the order of parameters and when I need to cast a value - will become tedious quickly. After all, you only have so many keystrokes left.
The solution is to create and use PostgreSQL functions that can take the inputs we need, do the correct calculations, and return the formatted value that we want. There are many ways we could accomplish a calculation like this in a function. Use this example as a starting place for your learning and exploration.
Note: In this example, I chose to return the value from this function as a numeric
data type because it can return values that look like integers (no decimals) or floats (decimals). As long as the return values are inserted into a table with the intended schema, this is a "trick" to visually see what we expect - an integer or a float. In general, the numeric
data type will often perform worse in queries and features like compression because of how numeric
values are represented internally. We recommend avoiding numeric
types in schema design whenever possible, preferring the float or integer types instead.
/*
* Function to create a random numeric value between two numbers
*
* NOTICE: We are using the type of 'numeric' in this function in order
* to visually return values that look like integers (no decimals) and
* floats (with decimals). However, if inserted into a table, the assumption
* is that the appropriate column type is used. The `numeric` type is often
* not the correct or most efficient type for storing numbers in a table.
*/
CREATE OR REPLACE FUNCTION random_between(min_val numeric, max_val numeric, round_to int=0)
RETURNS numeric AS
$$
DECLARE
value NUMERIC = random()* (min_val - max_val) + max_val;
BEGIN
IF round_to = 0 THEN
RETURN floor(value);
ELSE
RETURN round(value,round_to);
END IF;
END
$$ language 'plpgsql';
This example function uses the minimum and maximum values provided, applies the "range" formula we discussed earlier, and finally returns a numeric
value that either has decimals (to the specified number of digits) or not. Using this function in our query, we can simplify creating formatted values for sample data, and it cleans up the SQL, making it easier to read and use.
SELECT
time,
device_id,
random_between(3,100, 4) AS cpu,
random_between(28,83) AS temperature_c
FROM
generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time,
generate_series(1,10,1) AS device_id;
This query provides the same formatted output, but now it's much easier to repeat the process.
Creating more realistic text
What about text? So far, in both articles, we've only discussed how to generate numeric data. We all know, however, that time-series data often contain more than just numeric values. Let's turn to another common data type: text.
Time-series data often contains text values. When your schema contains log messages, item names, or other identifying information stored as text, we want to generate sample text that feels more realistic, even if it's random.
Let's consider the query used earlier that creates CPU and temperature data for a set of devices. If the devices were real, the data they create might contain an intermittent status message of varying length.
To figure out how to generate this random text, we will follow the same process as before, working directly in a stand-alone SQL query before moving our solution into a reusable function. After some initial attempts (and ample Googling), I came up with this example for producing random text of variable length using a defined character set. As with the random_between()
function above, this can be modified to suit your needs. For instance, it would be fairly easy to get unique, random hexadecimal values by limiting the set of characters and lengths.
Let your creativity guide you.
WITH symbols(characters) as (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 {}')),
w1 AS (
SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '') r_text, 'g1' AS idx
FROM symbols,
generate_series(1,10) as word(chr_idx) -- word length
GROUP BY idx)
SELECT
time,
device_id,
random_between(3,100, 4) AS cpu,
random_between(28,83) AS temperature_c,
w1.r_text AS note
FROM w1, generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time,
generate_series(1,10,1) AS device_id
ORDER BY 1,2;
time |device_id|cpu |temperature_c|note |
-----------------------------+---------+--------+-------------+----------+
2021-11-03 16:49:24.218 -0400| 1| 88.3525| 50|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400| 2| 29.5313| 53|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400| 3| 97.6065| 70|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400| 4| 96.2170| 40|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400| 5| 53.2318| 82|I}3U}FIsX9|
2021-11-03 16:49:24.218 -0400| 6| 73.7244| 56|I}3U}FIsX9|
In this case, it was easier to generate a random value inside of a CTE that we could reference later in the query. However, this approach has one problem that's pretty easy to spot in the first few rows of returned data.
While the CTE does create random text of 10 characters (go ahead and run it a few times to verify), the value of the CTE is generated once each time and then cached, repeating the same result over and over for every row. Once we transfer the query into a function, we expect to see a different value for each row.
For this second example function to generate "words" of random lengths (or no text at all in some cases), the user will need to provide an integer for the minimum and maximum length of the generated text. After some testing, we also added a simple randomizing feature.
Notice the IF...THEN condition that we added. Any time the generated number is divided by five and has a remainder of zero or one, the function will not return a text value. There is nothing special about this approach to providing randomness to the frequency of the output, so feel free to adjust this part of the function to suit your needs.
/*
* Function to create random text, of varying length
*/
CREATE OR REPLACE FUNCTION random_text(min_val INT=0, max_val INT=50)
RETURNS text AS
$$
DECLARE
word_length NUMERIC = floor(random() * (max_val-min_val) + min_val)::INTEGER;
random_word TEXT = '';
BEGIN
-- only if the word length we get has a remainder after being divided by 5. This gives
-- some randomness to when words are produced or not. Adjust for your tastes.
IF(word_length % 5) > 1 THEN
SELECT * INTO random_word FROM (
WITH symbols(characters) AS (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 '))
SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), ''), 'g1' AS idx
FROM symbols
JOIN generate_series(1,word_length) AS word(chr_idx) on 1 = 1 -- word length
group by idx) a;
END IF;
RETURN random_word;
END
$$ LANGUAGE 'plpgsql';
When we use this function to add random text to our sample time-series query, notice that the text is random in length (between 2 and 10 characters) and frequency.
SELECT
time,
device_id,
random_between(3,100, 4) AS cpu,
random_between(28,83) AS temperature_c,
random_text(2,10) AS note
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time,
generate_series(1,10,1) AS device_id
ORDER BY 1,2;
time |device_id|cpu |temperature_c|note |
-----------------------------+---------+--------+-------------+---------+
2021-11-04 14:17:03.410 -0400| 1| 86.5780| 67| |
2021-11-04 14:17:03.410 -0400| 2| 3.5370| 76|pCVBp AZ |
2021-11-04 14:17:03.410 -0400| 3| 59.7085| 28|kMrr |
2021-11-04 14:17:03.410 -0400| 4| 69.6153| 46|3UdA |
2021-11-04 14:17:03.410 -0400| 5| 33.0906| 56|d0sSUilx |
2021-11-04 14:17:03.410 -0400| 6| 44.2837| 74| |
2021-11-04 14:17:03.410 -0400| 7| 14.2550| 81|TOgbHOU |
Hopefully, you're starting to see a pattern. Using generate_series()
and some custom functions can help you create time-series data of many shapes and sizes.
We've demonstrated ways to create more realistic numbers and text data because they are the primary data types used in time-series data. Are there any other data types included with time-series data that you might need to generate with your sample data?
What about JSON values?
Creating sample JSON
Note: The sample queries below create JSON strings as the output with the intention that it would be inserted into a table for further testing and learning. In PostgreSQL, JSON string data can be stored in a JSON or JSONB column, each providing different features for querying and displaying the JSON data. In most circumstances, JSONB is the preferred column type because it provides more efficient storage and the ability to create indexes over the contents. The main downside is that the actual formatting of the JSON string, including the order of the keys and values, is not retained and may be difficult to reproduce exactly. To better understand the differences of when you would store JSON string data with one column type over the other, please refer to the PostgreSQL documentation.
PostgreSQL has supported JSON and JSONB data types for many years. With each major release, the feature set for working with JSON and overall query performance improves. In a growing number of data models, particularly when REST or Graph APIs are involved, storing extra meta information as a JSON document can be beneficial. The data is available if needed while facilitating efficient queries on serialized data stored in regular columns.
We used a design pattern similar to this in our NFT Starter Kit. The OpenSea JSON API used as the data source for the starter kit includes many properties and values for each asset and collection. A lot of the values weren't helpful for the specific analysis in that tutorial. However, we knew that some of the values in the JSON properties could be useful in future analysis, tutorials, or demonstrations. Therefore, we stored additional metadata about assets and collections in a JSONB field to query it if needed. Still, it didn't complicate the schema design for otherwise common data like name
and asset_id
.
Storing data in a JSON field is also a common practice in areas like IIoT device data. Engineers usually have an agreed-upon schema to store and query metrics produced by the device, followed by a "free form" JSON column that allows engineers to send error or diagnostic data that changes over time as hardware is modified or updated.
There are several approaches to add JSON data to our sample query. One added challenge is that JSON data includes both a key and a value, along with the possibility of numerous levels of child object nesting. The approach you take will depend on how complex you want the PostgreSQL function to be and the end goal of the sample data. In this example, we'll create a function that takes an array of keys for the JSON and generates random numerical values for each key without nesting. Generating the JSON string in SQL from our values is straightforward, thanks to built-in PostgreSQL functions for reading and writing JSON strings. 🎉
As with the other examples in this post, we'll start by using a CTE to generate a random JSON document in a stand-alone SELECT query to verify that the result is what we want. Remember, we'll observe the same issue we had earlier when generating random text in the stand-alone query because we are using a CTE. The JSON is random every time the query runs, but the string is reused for all rows in the result set. CTE's are materialized once for each reference in a query, whereas functions are called again for every row. Because of this, we won't observe random values in each row until we move the SQL into a function to reuse later.
WITH random_json AS (
SELECT json_object_agg(key, random_between(1,10)) as json_data
FROM unnest(array['a', 'b']) as u(key))
SELECT json_data, generate_series(1,5) FROM random_json;
json_data |generate_series|
----------------+---------------+
{"a": 6, "b": 2}| 1|
{"a": 6, "b": 2}| 2|
{"a": 6, "b": 2}| 3|
{"a": 6, "b": 2}| 4|
{"a": 6, "b": 2}| 5|
We can see that the JSON data is created using our keys (['a','b']) with numbers between 1 and 10. We just have to create a function that will create random JSON data each time it is called. This function will always return a JSON document with numeric integer values for each key we provide for demonstration purposes. Feel free to enhance this function to return more complex documents with various data types if that's a requirement for you.
CREATE OR REPLACE FUNCTION random_json(keys TEXT[]='{"a","b","c"}',min_val NUMERIC = 0, max_val NUMERIC = 10)
RETURNS JSON AS
$$
DECLARE
random_val NUMERIC = floor(random() * (max_val-min_val) + min_val)::INTEGER;
random_json JSON = NULL;
BEGIN
-- again, this adds some randomness into the results. Remove or modify if this
-- isn't useful for your situation
if(random_val % 5) > 1 then
SELECT * INTO random_json FROM (
SELECT json_object_agg(key, random_between(min_val,max_val)) as json_data
FROM unnest(keys) as u(key)
) json_val;
END IF;
RETURN random_json;
END
$$ LANGUAGE 'plpgsql';
With the random_json()
function in place, we can test it in a few ways. First, we'll simply call the function directly without any parameters, which will return a JSON document with the default keys provided in the function definition ("a", "b", "c") and values from 0 to 10 (the default minimum and maximum value).
SELECT random_json();
random_json |
------------------------+
{"a": 7, "b": 3, "c": 8}|
Next, we'll join this to a small numeric set from generate_series()
.
SELECT device_id, random_json() FROM generate_series(1,5) device_id;
device_id|random_json |
---------+-------------------------+
1|{"a": 2, "b": 2, "c": 2} |
2| |
3|{"a": 10, "b": 7, "c": 1}|
4| |
5|{"a": 7, "b": 1, "c": 0} |
Notice two things with this example.
First, the data is different for each row, showing that the function gets called for each row and produces different numeric values each time. Second, because we kept the same random output mechanism from the random_text()
example, not every row includes JSON.
Finally, let's add this into the sample query for generating device data that we've used throughout this article to see how to provide an array of keys ("building" and "rack") for the generated JSON data.
SELECT
time,
device_id,
random_between(3,100, 4) AS cpu,
random_between(28,83) AS temperature_c,
random_text(2,10) AS note,
random_json(ARRAY['building','rack'],1,20) device_location
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time,
generate_series(1,10,1) AS device_id
ORDER BY 1,2;
time |device_id|cpu |temperature_c|note |device_location |
-----------------------------+---------+--------+-------------+---------+----------------------------+
2021-11-04 16:19:22.991 -0400| 1| 14.7614| 70|CTcX8 2s4| |
2021-11-04 16:19:22.991 -0400| 2| 62.2618| 81|x1V |{"rack": 4, "building": 5} |
2021-11-04 16:19:22.991 -0400| 3| 10.1214| 50|1PNb | |
2021-11-04 16:19:22.991 -0400| 4| 96.3742| 29|aZpikXGe |{"rack": 12, "building": 4} |
2021-11-04 16:19:22.991 -0400| 5| 22.5327| 30|lM |{"rack": 2, "building": 3} |
2021-11-04 16:19:22.991 -0400| 6| 57.9773| 44| |{"rack": 16, "building": 5} |
...
There are just so many possibilities for creating sample data with generate_series()
, PostgreSQL functions, and some custom logic.
Putting it all together
Let's put what we've learned into practice, using these three functions to create and insert ~1 million rows of data and then query it with the hyperfunctions time_bucket()
, time_bucket_ng()
, approx_percentile()
and time_weight()
. To do this, we'll create two tables: one will be a list of computer hosts and the second will be a hypertable that stores fake time-series data about the computers.
Step 1: Create the schema and hypertable
CREATE TABLE host (
id int PRIMARY KEY,
host_name TEXT,
LOCATION jsonb
);
CREATE TABLE host_data (
date timestamptz NOT NULL,
host_id int NOT NULL,
cpu double PRECISION,
tempc int,
status TEXT
);
SELECT create_hypertable('host_data','date');
Step 2: Generate and insert data
-- Insert data to create fake hosts
INSERT INTO host
SELECT id, 'host_' || id::TEXT AS name,
random_json(ARRAY['building','rack'],1,20) AS LOCATION
FROM generate_series(1,100) AS id;
-- insert ~1.3 million records for the last 3 months
INSERT INTO host_data
SELECT date, host_id,
random_between(5,100,3) AS cpu,
random_between(28,90) AS tempc,
random_text(20,75) AS status
FROM generate_series(now() - INTERVAL '3 months',now(), INTERVAL '10 minutes') AS date,
generate_series(1,100) AS host_id;
Step 3: Query data using time_bucket()
and time_bucket_ng()
-- Using time_bucket(), query the average CPU and max tempc
SELECT time_bucket('7 days', date) AS bucket, host_name,
avg(cpu),
max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '1 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;
-- try the experimental time_bucket_ng() to query data in month buckets
SELECT timescaledb_experimental.time_bucket_ng('1 month', date) AS bucket, host_name,
avg(cpu) avg_cpu,
max(tempc) max_temp
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '3 month'
GROUP BY 1,2
ORDER BY 1 DESC, 2;
Step 4: Query data using toolkit hyperfunctions
-- query all host in building 10 for 7 day buckets
-- also try the new percentile approximation function to
-- get the p75 of data for each 7 day period
SELECT time_bucket('7 days', date) AS bucket, host_name,
avg(cpu),
approx_percentile(0.75,percentile_agg(cpu)) p75,
max(tempc)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE date > now() - INTERVAL '1 month'
AND LOCATION -> 'building' = '10'
GROUP BY 1, 2
ORDER BY 1 DESC, 2;
-- To test time-weighted averages, we need to simulate missing
-- some data points in our host_data table. To do this, we'll
-- randomly select ~10% of the rows, and then delete them from the
-- host_data table.
WITH random_delete AS (SELECT date, host_id FROM host_data
JOIN host ON host_id = id WHERE
date > now() - INTERVAL '2 weeks'
ORDER BY random() LIMIT 20000
)
DELETE FROM host_data hd
USING random_delete rd
WHERE hd.date = rd.date
AND hd.host_id = rd.host_id;
-- Select the daily time-weighted average and regular average
-- of each host for building 10 for the last two weeks.
-- Notice the variation in the two numbers because of the missing data.
SELECT time_bucket('1 day',date) AS bucket,
host_name,
average(time_weight('LOCF',date,cpu)) weighted_avg,
avg(cpu)
FROM host_data
JOIN host ON host_data.host_id = host.id
WHERE LOCATION -> 'building' = '10'
AND date > now() - INTERVAL '2 weeks'
GROUP BY 1,2
ORDER BY 1 DESC, 2;
In a few lines of SQL, we created 1.3 million rows of data and were able to test four different functions in TimescaleDB, all without relying on any external source. 💪
Still, you may notice one last issue with the values in our host_data
table (even though the values are not more realistic in nature). By using random()
as the basis for our queries, the calculated numeric values all tend to have an equal distribution within the specified range which causes the average of the values to always be near the median. This makes sense statistically, but it highlights one other area of improvement to the data we generate. In the third post of this series, we'll demonstrate a few ways to influence the generated values to provide shape to the data (and even some outliers if we need them).
Reviewing our progress
When using a database like TimescaleDB or testing features in PostgreSQL, generating a representative dataset is a beneficial tool to have in your SQL toolbelt.
In the first post, we learned how to generate lots of data by combining the result sets of multiple generate_series()
functions. Using the implicit CROSS JOIN
, the total number of rows in the final output is a product of each set together. When one of the data sets contains timestamps, the output can be used to create time-series data for testing and querying.
The problem with our initial examples was that the actual values we generated were random and lacked control over their precision - and all of the data was numeric. So in this second post, we demonstrated how to format the numeric data for a given column and generate random data of other types, like text and JSON documents. We also added an example in the text and JSON functions that created randomness in how often the values were emitted for each of those columns.
Again, all of these are building block examples for you to use, creating functions that generate the kind of data you need to test.
To see some of these examples in action, watch my video on creating realistic sample data:
In part 3 of this series, we will demonstrate how to add shape and trends into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using the formatting functions in this post in conjunction with relational lookup tables and additional mathematical functions. Knowing how to manipulate the pattern of generated data is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.
If you have questions about using generate_series()
or have any questions about TimescaleDB, please join our community Slack channel, where you'll find an active community and a handful of the Timescale team most days.
If you want to try creating larger sets of sample time-series data using generate_series()
and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)