Jan 17, 2024
Posted by
Attila Toth
The time dimension of the database is an essential element of every blockchain. Without it, blockchains could not operate: transactions and blocks could not be added. Data associated with blockchains is inherently time-series data.
At Timescale, we're always looking for ways to showcase the expanding reach of time-series data. Using our PostgreSQL and TimescaleDB superpowers, we are able to generate insight into the data behind everything in the world. With time-series data, we’re not only able to see how things are but also how things have changed and are in the process of changing. Measuring change is an essential component of better understanding the world.
In this post, we’ll look at tips and strategies for building blockchain apps on Postgres. We’ll also look at the ways that Postgres and TimescaleDB support each other to create more dynamic apps.
Blockchain data combines many different types of data—including time-series data. You can get a lot of value from combining these datasets.
In a past blog post, I did exactly that: I analyzed the Bitcoin (BTC) blockchain with PostgreSQL, looking at the transaction volume and BTC USD rate to look into how they correlate and why.
I think this kind of analysis is really interesting because these are two different datasets: on one side, you have the transaction volume, which you might be able to get directly from the blockchain by ingesting the on-chain data, and then you aggregate all the transactions by daily or weekly aggregates; on the other side, you combine this dataset with cryptocurrency data, which has the BTC USD rate. And again, this is all happening within the PostgreSQL database.
Some time-series databases won’t allow you to store different types of datasets within the database, so you end up with time-series data in one database and then your relational data in another database. To analyze all this data, you have to sort of combine them on an application layer, and that, as we all know, can be a headache.
With TimescaleDB, because we’re PostgreSQL, you can store all sorts of data in the database: we are a general database, but we also did a lot of optimizations for time-series, a.k.a., blockchain data.
Another raw data example that can be used as a dataset in your data application database is this one.
This is data from the BTC blockchain, on-chain data. As you can see, we have a time column (that will allow you to join this dataset with others), block ID, hash size, block size, etc.
Looking at both these examples, you can notice a few things:
So, if you’re building a real-time data application, you must ensure that your database can handle this type of data, which isn’t necessarily easy with analytical databases. This is one of the main reasons why users try TimescaleDB: it’s a relational database but performs amazingly well with analytics and historical data.
Finally, blockchain data is connected to other datasets. Even if you work for a business, you can pull internal data and combine it with blockchain public data.
Now let’s look at the features a database must have to seamlessly handle blockchain data.
To explore the world of Bitcoin transactions, you need to ensure your database infrastructure is reliable, scalable, and maintainable—especially if you’re building a data application on top of this data.
Using PostgreSQL to store blockchain data gives you a reliable foundation that can keep up with the relentless nature of blockchains—which are always up and running.
You need a schema to store anything in a relational database like TimescaleDB. Let's see a simple yet effective schema that you can start to store Bitcoin transactions.
The dataset collected for this blog post uses Blockchair’s data dumps. Those data files, just like the data on the actual blockchain, have more data fields than what this article covers. But still, this example schema provides several data fields (e.g., transaction weight, output, fee, etc.) that can get you started and help you generate insights.
Nonetheless, this schema has a detailed JSONB column that stores all additional information if you decide to analyze the data fields not covered in the blog post.
Data field descriptions:
Field | Description |
time | Timestamp of the transaction |
block_id | Block ID |
hash | Hash ID of the transaction |
size | Size of the transaction in KB |
weight | Size of the transaction in weight units |
is_coinbase | Miner revenue transaction |
output_total | The value of the transaction in Satoshi (sat) |
output_total_usd | The value of the transaction in USD |
fee | Transaction fee in Satoshi (sat) |
fee_usd | Transaction fee in USD |
details | Additional details about the transaction (e.g., input_total, lock_time, fee_per_kwu, etc.) |
Database schema:
CREATE TABLE transactions (
time TIMESTAMPTZ,
block_id INT,
hash TEXT,
size INT,
weight INT,
is_coinbase BOOLEAN,
output_total BIGINT,
output_total_usd DOUBLE PRECISION,
fee BIGINT,
fee_usd DOUBLE PRECISION,
details JSONB
);
SELECT create_hypertable('transactions', 'time');
This schema vastly simplifies analyzing the dataset but still leaves room for future changes in case you find the data fields in details useful; they are still available through the JSONB column.
The create_hypertable
function transforms the regular PostgreSQL table into a TimescaleDB hypertable to partition the table based on the timestamp column, thus optimizing the querying and ingestion of time-series data—while behaving just like a regular PostgreSQL table.
To make your database “blockchain-scale,” you need to apply some optimization strategies in PostgreSQL and TimescaleDB to enjoy a satisfying experience while maintaining and querying the data.
Using a hypertable to store transactions lays the groundwork for simple and scalable time-series storage. TimescaleDB, being a PostgreSQL database, lets you use all the PostgreSQL optimization strategies (e.g., indexes) along with TimescaleDB features (like continuous aggregates) to get the best of both tools.
In sum, you can achieve the best experience in your database if you use the things that make PostgreSQL exceptional in the relational database world and TimescaleDB exceptional in the time-series world.
Let’s start with indexes.
When you create a hypertable, TimescaleDB automatically adds a B-tree index on the timestamp column of the table. This already improves most of the queries where you filter by the time column.
For example, this is what you can do to find the most recent transactions from the past two weeks in which the transaction fee was higher than the transaction value:
SELECT
time,
hash,
fee*0.00000001 as fee_btc,
output_total*0.00000001 as value_btc,
output_total_usd as value_usd
FROM transactions
WHERE "time" > NOW() - INTERVAL '2 weeks'
ORDER BY fee_btc DESC
Just by using TimescaleDB, this query performs immediately better than with regular PostgreSQL. Under the hood, TimescaleDB speeds up this query by using the index and excluding chunks that are outside the defined time period. You can also decide to add more indexes based on the kind of queries you plan to write.
For example, if you plan to build a data application that analyzes individual Bitcoin transactions, you can achieve better query performance by adding an index to the hash column.
Each transaction on the blockchain has a hash value, a 64-character long string that acts as the transaction’s unique identifier. If your application users search for a specific transaction to find out more about it, this is the column they should use to find the transaction in the table.
SELECT * FROM transactions
WHERE hash = '94fb53b99dd2cb7ccf26745c908eebd7c01cb8c19fb84819b796beb7750b0cb8'
You can speed up this query by adding a HASH INDEX to the hash column.
CREATE INDEX hash_idx ON public.transactions USING HASH (hash)
If you store more than one hundred million rows of transactions and want to filter by the hash column, this index is essential. It has the power to turn a query that runs for 10 or more seconds into a query that finishes in less than a second.
You can also ensure that you don’t accidentally insert duplicate records by adding a UNIQUE INDEX on the time and hash columns.
CREATE UNIQUE INDEX time_hash_idx ON public.transactions (time, hash)
Continuous aggregates are materialized views for time-series data. They were designed with time-series data in mind and provide an efficient way to refresh and query the values stored. They even store real-time data. After setting up proper indexes, this is the most important optimization that you can do to make your queries lightning fast.
When it comes to time-series data, users usually aggregate data based on time, creating so-called “time buckets.” Time buckets enable you to analyze how things change over time. In the case of the Bitcoin blockchain, it allows you to see how transaction fees, transaction sizes, block weights, etc., change over time.
These aggregated time-series queries are often recurring queries done by a dashboard or some user-facing application. Hence, it’s important to make these queries as fast as possible to provide a smooth user experience. This is precisely what TimescaleDB’s continuous aggregates will do.
You can set up continuous aggregates to pre-aggregate and materialize information within the database and go from seconds or even minutes to sub-seconds in query execution time.
Even when analyzing a large blockchain database with more than 150 million rows, you can achieve great performance and sub-second query execution speed just by using indexes and continuous aggregates.
Let’s see examples of continuous aggregate setups to store and serve blockchain data.
Continuous aggregates, set up properly, can speed up your queries and make those queries much easier to write. Let’s choose three aspects of the data as foundations for the continuous aggregate definitions:
Blockchains are designed to record information without ever updating historical data. When it comes to Bitcoin, the blockchain is just a digital public ledger containing information about transactions. Simply put, a transaction tells you who sent what amount to whom, plus the technical details.
Each transaction has at least one input and output field that indicates which address(es) were used to send BTC to which address(es). You can have multiple “senders” and “receivers” in the same transaction.
Transactions are not mined separately; they are mined (added to the blockchain) in groups. These groups are called blocks.
A blockchain is made up of blocks. These blocks are important from a data analytics perspective for many reasons. Miners get their revenue based on the block they add to the blockchain. Each block has a unique identifier called block_id. Before SegWit, a block had a size limit of 1 MB. After SegWit, a new measure was added to the network called weight unit, and the new limit for a block became four million weight units.
Miners are essential for any blockchains. They keep the network up and running. Miners are incentivized with BTC to maintain and validate the order of blocks on the blockchain. In the case of Bitcoin, miners get their revenue from two sources: block rewards and transaction fees. The block reward is currently 6.25 BTC, and it gets halved every 210,000 blocks mined (about every four years). Blockchain users pay the transaction fee every time they submit a transaction.
With continuous aggregates and automation policies, you can simplify your work going forward:
UNION
in all of your queries).After setting up indexes and continuous aggregates as optimizations, you can focus on analyzing the data and generating insights while having a fast and simple querying experience.
Now that you know which features we used to analyze our data using PostgreSQL and TimescaleDB, let’s look at the thought process behind it.
Of course, we think our database is the best option. Give us a chance to explain why with this brief architectural overview.
TimescaleDB is built on PostgreSQL. We are not PostgreSQL-like or a PostgreSQL fork. We are 100 percent PostgreSQL because we’re an extension, which means that our storage layer is the same that you’ll find in PostgreSQL.
However, we modified PostgreSQL’s schema management by introducing auto-partitioning. Very simply, we partition your data based on the time column, providing long-term scalability and some other significant advantages.
Let’s say, for example, that you want to drop one or two years of data. In a relational database, you’d do a DELETE statement, deleting rows one by one. With auto-partitioning, or what we call a chunk mechanism, you can set data retention policies and ask the database to only delete data that is one year or older, for example. The database will then drop tables, making it a faster and more efficient process due to the chunk mechanism.
We sometimes modify the query planner, especially with time-series data, if you’re filtering by time or aggregating by time, which is something you’d normally do if you’re analyzing blockchain data. So maybe you’re creating one-day time buckets or five-minute time buckets: either way, we’ll have features that will make the process flow seamlessly and perform better.
We are full SQL, meaning everything that works in a PostgreSQL database will also work in TimescaleDB. We don’t take anything away from PostgreSQL; we just add a lot of new functionality, like time-series functions.
These will help you analyze time-series data and blockchain data. For example, we have functions that will help you create the values needed to build candlestick charts. So you can generate the open, high, low, and close values within the database from row data.
We also provide a bunch of other statistical functions that can be useful to analyze blockchain data:
One last piece of advice: if you’re working with blockchain data, try continuous aggregates.
They’re relatively simple to implement and will bring enormous benefits, especially if you’re building a customer-facing analytics platform or maybe software that is not a traditional data application but provides analytics for the users.
We hope that you got inspired to do your own research with PostgreSQL and TimescaleDB. The easiest way to get started is to sign up for a free Timescale account and complete the tutorial that walks you through the process of sourcing, ingesting, storing, and analyzing blockchain data.
In case you have questions that you want to ask the community or Timescale team members, feel free to join the Timescale Forum and Slack Community where we’d be delighted to help you out!