Jan 04, 2024
Posted by
Chris Engelbert
A good, future-proof data model is one of the most challenging problems when building applications. That is specifically true when working on applications meant to store (and analyze) massive amounts of data, such as time series, log data, or event-storing ones.
Deciding what data types are best suited to store that kind of information comes down to a few factors, such as requirements on the precision of float-point values, the actual values content (such as text), compressibility, or query speed.
In this installment of the best practices series (see our posts on narrow, medium, and wide table layouts, single or partitioned hypertables, and metadata tables), we’ll have a look at the different options in PostgreSQL and TimescaleDB regarding most of these questions. While unable to answer the requirement question, a few alternatives may be provided (such as integers instead of floating point)—but more on that later.
Event-like data, such as time series, logs, and similar use cases, are notorious for ever-growing amounts of collected information. Hence, it’ll grow continuously and require disk storage.
But that’s not the only issue with big data. Querying, aggregating, and analyzing are some of the others. Reading this amount of data from disk requires a lot of I/O operations (IOPS; input/output operations per second), which is one of the most limiting factors in cloud environments, and even on-premise systems (due to how storage works in general). While non-volatile memory express (NVMes) transfer protocols and similar technologies can help you optimize for high IOPS, they’re not limitless.
That’s where compression comes in. TimescaleDB’s compression algorithms (and, to some extent, the default PostgreSQL’s) help decrease disk space requirements and IOPS, improving cost, manageability, and query speed.
But let’s get to the actual topic: best practices for data types in TimescaleDB.
PostgreSQL (and the SQL standard in general) offers a great set of basic data types, providing a perfect choice for all general use cases. However, you should be discouraged from using them. The PostgreSQL Wiki provides a great list of best practices regarding data types to use or avoid. Anyhow, you don’t immediately have to jump over; we’ll cover most of them here 🥹.
When looking back at the best practices on table layout, medium and wide table layouts tend to have a few too many nullable columns, being placeholders for potential values.
Due to how PostgreSQL stores nullable values that are NULL
, those are almost free. Having hundreds of nullable columns, most being NULL
, is not an issue. The same is true for TimescaleDB’s custom compression. Due to storing data in a columnar format, empty row values are almost free when compressed (null bitmap).
A boolean value is a logical data type with one of two possible values, TRUE
or FALSE
. It is normally used to record decisions or states.
There isn’t much specific to booleans in TimescaleDB. They are a very simple data type but also a great choice. Still, people often use an integer to represent their values as 1
or 0
. This may come in handy with narrow or medium table layouts where you want to limit the number of columns. Nothing speaks against either solution!
In terms of compressibility, booleans aren’t heavily optimized but compress fairly well with the standard compression. If you have a series of states, it may be recommended to only store state changes, though, removing duplicates from the dataset.
Floating-point data types represent real numbers, most often decimal floating points of base ten. They are used to store all kinds of information, such as percentages, measurements like temperature or CPU usage, or statistical values.
There are two groups of floating-point numbers in PostgreSQL, float4 (a.k.a., real), float8 (double precision), and numeric.
Float4 and float8 columns are the recommended data types. TimescaleDB will handle them specifically (during compression) and optimize their use. On the other hand, numeric, as an arbitrary precision-sized data type, isn’t optimized at all. Numeric isn’t recommended.
In general, though, due to the complexity of floating-point numbers, if you know the required precision upfront, you could use the multiply-division trick and store them as integers, which are better optimized. For example, consider we want to store a temperature value (in Kelvin) and only two decimal places, but the value comes in as a float4.
float4 originalValue = 298.151566;
int storedValue = (round(originalValue * 100))::int;
float4 queryValue = storedValue::float4 / 100;
It is a trick often used in data transmission for embedded devices with a low throughput uplink to limit the number of bytes sent.
Integer data types represent natural numbers of various sizes (meaning valid number ranges, depending on how many bytes are used to represent them). Integer values are often used for simple values, such as counts of events or similar.
All integer types (int2, SmallInt, int4, Integer, int8, BigInt) are recommended data types. TimescaleDB is heavily optimized to compress the values of those data types. No less than three compression algorithms are working in tandem to get the most out of these data types.
This is why it is advised (if you know the necessary precision of a floating-point value) to store the values as integers (see floating-point values for more information).
What is true for integers is also true for all serial data types (serial2, SmallSerial, serial4, Serial, serial8, BigSerial), as those are magical “aliases” for their integer counterparts, incorporating the automatic creation of sequences to fill in their values on insert.
That said, they use their corresponding integer data types as a column data type. Anyhow, the PostgreSQL best practices advise against using them and recommend using identity columns instead for anything PostgreSQL from version 10 onwards.
Timestamps and time and date data types represent a specific point in time, some with more and some with less explicit information. All these data types have versions with and without timezone information attached (except date).
Before going into details, I’d generally advise against any of the data types without timezones (timestamp without time zone, timestamp, time without time zone, time). Most of them are discouraged by PostgreSQL’s best practices and shouldn’t be used.
It’s a misconception that it would save any storage space, as many believe, and TimescaleDB doesn’t have any optimization for data types without timezones. While it works, it will add a lot of casting overhead which is implicit and, therefore, not immediately visible. That said, just don’t do it 🔥.
With that out of the way, you can use date and time, but you should consider the use case. While date is optimized and compressed using the same compression scheme as integers, time is not. In any case, you shouldn’t use both for the time-dimension column.
To store dates, you should consider using timestamptz with the time portion set to midnight in the necessary timezone (2023-01-01T00:00:00+00
) to prevent any casting overhead when querying.
Likewise, you can use timestamptz to store a time value only. In this case, you encode the time portion to a specific date (such as 1970-01-01T15:01:44+04
) and cast the final value back into a time value. Alternatively, you can store the value as an integer by encoding the time into the (nano)seconds since midnight or any other encoding you can come up with.
That leaves us with timestamptz (timestamp with time zone). You’ve guessed it: this is the recommended data type for any kind of point-in-time storage.
This data type is highly optimized, used by all internal functionality, and employs the same compression as integers. Said compression is especially effective with timestamps, as they tend to have little difference between two consecutive values and compress extremely well.
Still, be aware that some frameworks, object-relational mappings (ORMs), or tools love their timestamp without time zone and need to be forced to be good citizens.
Text values are used to store textual values of arbitrary size. Those values can include detailed descriptions, log messages, and metric names or tags. The available data types include text, char(n), and varchar(n).
PostgreSQL’s best practices advise against using char(n), as it will pad values shorter than n to that size and waste storage. It recommends using text instead.
The same is true with varchar(n) with a length limit. Consider using varchar (without length limit) or text.
From a TimescaleDB-specific perspective, there isn’t much to say except you may want to deduplicate long values using a separate table holding the actual value and a reference (such as a checksum on the content) and storing the reference in the hypertable.
TimescaleDB doesn’t offer any specific optimization to handle this type of data. It will, however, apply dictionary compression (lz4-based) to those text fields.
Byte arrays (in PostgreSQL represented by the data type bytea) store arbitrary large sequences of bytes, which may represent anything, from encoded machine state to binary data packets.
When looking at customer/user use cases, it is a very uncommon data type, as most data is decoded before being stored in the database. Therefore, TimescaleDB doesn’t optimize anything about this data type. Compression, however, should use the lz4-based compression.
If you have recurring, large bytea values, you can store them outside the actual hypertable and apply deduplication, as explained for text columns.
Compared to basic data types, complex data types commonly encode multiple values into a single column. This may include custom composite types.
Structural data types encode complete objects or sets of information, often lossy in terms of data types of the actual values. PostgreSQL supports three structural data types, JSON, JSONB (a binary representation of JSON), and XML.
Values often contain complex state information from machines or sensors. They are also common with narrow table layouts to compensate for the different values that need to be stored.
To get it out of the way, if you want to store JSON-like data, don’t use JSON—use JSONB! It’s better regarding storage space, query speed, and anything you can think of.
The only disadvantage of JSONB is that you lose the original order of properties since JSONB will decompose the object for better efficiency due to the way it is stored internally. Anyhow, not sure I can come up with a great reason for why the order should matter, and I hope you agree 😉.
That said, JSONB is a valid choice when storing complex data. The amount of stored data stored should be kept under close observation, though. If you have recurring, large amounts of data inside the JSONB objects, it may be advisable to extract that meta information into a separate, vanilla PostgreSQL table and join them in at query time. An approach is shown in the text values section.
For the XML, I don’t have enough data to give any recommendations. Actually, I cannot remember anyone ever asking anything about it. Due to my scarce experience, I wouldn't advise using it. I guess it may behave similarly to JSONB, but that’s reading clouds.
For all the above data types, TimescaleDB won’t apply any specific optimizations, and compression will likely use the dictionary algorithm, which still yields impressive results (been there, done that).
There are a few things to be wary of when using those data types. Remember that the whole object needs to be read, parsed, and the requested path’s value extracted. That happens for every single row of data being queried.
One thing I learned using JSONB is that indexes on expressions and GIN indexes are your friends if you need to make selection decisions based on data inside the object (such as tags). Ensure that anything you need to make that decision is in an index to read and (potentially) decompress the object.
The second element to remember is that extracting values from those objects yields text values that need to be cast into the required data type. Even if the value is stored as a number inside the JSONB object, extracting it returns the text representation, which then needs to be cast back into an integer or floating-point value, adding a lot of casting overhead to the process.
Using UUID to store unique identifiers is common in the PostgreSQL world. They are used for various reasons, including simplifying the generation of highly unlikely and colliding values on a different system than the database for “security” reasons (meaning removing potentially guessable series of numbers) and others.
A UUID represents a (semi)random 128-bit number in the format specified in RFC 4122 (ISO/IEC 9834-8:2005), which looks similar to a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
.
TimescaleDB doesn’t optimize this data type at all. It will compress it using the dictionary compression algorithm. However, there are other elements to keep in mind when using UUID. While they offer many advantages and simplify distributed systems, they also create some locality issues, specifically with BTree indexes. You will find a great read on this topic in the Cybertec blog.
My recommendation would be to think twice about using UUIDs.
While ltree is an uncommon data type when storing event-like data, it can be a great fit for deduplicating recurring, larger values. Ltree represents a certain (sub)path in a tree, illustrating a specific device in an overarching topology. Their values resemble a canonical, dot-separated path (building1.floor3.room10.device12
).
Values of data type ltree aren’t handled by TimescaleDB specifically. However, due to their size, this isn’t a big deal.
As mentioned above, ltree values are perfect for deduplicating recurring data, such as meta information, which are unlikely or slowly changing. Combined with a hash/checksum, they can be used as a reference to look them up in an external table.
hstore
hstore
provides the capability to store key-value pairs of data, similar to what a non-nested JSON(B) object would offer.
From my experience, only a few people use hstore, and most prefer JSONB. One reason may be that hstore only offers text values. That said, there is no experience in compression gains or speed implications. I guess a lot of what was noted on JSONB objects would hold for hstore.
Last but not least, PostGIS’ data types are like geometry and geography. PostGIS data types are used to store special information of any kind. Typical use cases are GPS positions, street information, or positional changes over time.
TimescaleDB works perfectly fine with PostGIS data types but doesn’t optimize them. Compression ratios, while not perfect, are still decent.
Not a lot to mention here: they work, and they work great.
Recommending data types is hard since many decisions depend on requirements outside the scope of this blog post. However, I hope this read provides insight into which data types to avoid, which ones are fine, and how you may optimize okay data types.
To summarize, here’s a small table showing different aspects and categorizing them into four states: great 🟢, okay, but be wary of something 🟠, or avoid at any cost 🔴, as well as unknown (feel free to provide experience/feedback) ⚪️.
Data Type | Query Speed | Compressibility | Recommended | Alternative |
---|---|---|---|---|
boolean | 🟢 | 🟢 | 🟢 | |
real, float4 | 🟢 | 🟢 | 🟢 | |
double (precision), float8 | 🟢 | 🟢 | 🟢 | |
numeric | 🔴 | 🔴 | 🔴 | floats, integers |
smallint, int2 | 🟢 | 🟢 | 🟢 | |
integer, int4 | 🟢 | 🟢 | 🟢 | |
bigint, int8 | 🟢 | 🟢 | 🟢 | |
smallserial, serial2 | 🟢 | 🟢 | 🟠 | int2 with identity |
serial, serial4 | 🟢 | 🟢 | 🟠 | int4 with identity |
bigserial, serial8 | 🟢 | 🟢 | 🟠 | int8 with identity |
timestamp without time zone, timestamp | 🔴 | 🟢 | 🔴 | timestamptz |
timestamp with time zone, timestamptz | 🟢 | 🟢 | 🟢 | |
time | 🔴 | ⚪️ | 🔴 | timestamptz |
timetz | 🔴 | ⚪️ | 🔴 | timestamptz |
date | 🔴 | 🟢 | 🔴 | timestamptz |
char(n) | 🟢 | 🟠 | 🔴 | text |
varchar(n) | 🟢 | 🟠 | 🔴 | text |
text | 🟢 | 🟠 | 🟠 | |
bytea | 🟢 | 🟠 | 🟠 | |
json | 🔴 | 🟠 | 🔴 | jsonb |
jsonb | 🟢 | 🟠 | 🟠 | |
xml | ⚪️ | ⚪️ | ⚪️ | jsonb |
uuid | 🟠 | 🟠 | 🟠 | |
ltree | 🟢 | 🟠 | 🟢 | |
hstore | ⚪️ | ⚪️ | ⚪️ | jsonb |
geometry, geography (PostGIS) | 🟢 | 🟠 | 🟢 |
If you want to know more about the compression algorithms used, see the blog post one of my colleagues wrote, Time-Series Compression Algorithms, Explained.
Also, if you have any feedback or experience with any of the above data types or the non-mentioned ones, I’d be happy to hear from you on Twitter or our Community Slack (@Chris Engelbert)!
Finally, I’m sorry for the wall of text you’ve had to read to get here, but there’s a lot of information to share and many different data types (and those aren’t even all of them). I hope you enjoyed the read and learned something new along the way.
If you want to test Timescale right now, the easiest and fastest way to get started is to sign up for our 30-day Timescale free trial. To try self-managed TimescaleDB, see the documentation for further information.