Mar 10, 2021
Posted by
Matvey Arye
By utilizing the properties of time-series data, can we make the maintenance of materialized views efficient enough so they act like indexes?
Materialized views have been part of the data ecosystem for decades. From OLAP data cubes to continuous queries, the idea of pre-computing information needed to execute common queries is an obvious and common one. It is driven by the need for users to see low-latency results for their queries. This need is only expanding as we become increasingly connected and are driven to understand and analyze our devices in near real-time.
Indexing is another technique often used for the same purpose: to allow users to get results quicker. Yet indexes and materialized views feel very different: indexes are viewed as part of the table, while materialized views are seen as their own separate entity. Often those running analyses don’t even think about the indexes, interacting solely with the raw data while indexing decisions are made by the database operator. Indexing is understood as a pure optimization technique; in a sense they are transparent to the data consumer.
In contrast, the mental model for materialized views is that they are a separate thing from the underlying data. Materialized views are built from the underlying data but maintained and consumed separately, and users query these views separately from the table they are based on.
We argue that the reason that these two techniques are treated so differently is that indexes are transparent: queries using an index always return the same data as querying without an index, while querying a materialized view often returns stale data. An index can be “hidden”, while the choice of using a materialized view must be left to the user.
But why does a materialized view need to return stale data? We contend that returning stale data is the result of limitations in implementation and workload characteristics, not a fundamental characteristic of a materialized view in itself.
With traditional OLTP or OLAP workloads, random parts of your data space are written by each transaction. This makes maintaining live aggregates over your data unwieldy. Recomputing an aggregate each time an included row is modified is very expensive if those aggregates are spread out all over your disk, and you can’t amortize the cost. Imagine changing one row in an aggregate of 40,000 rows and having to recompute the entire aggregate. How wasteful! To get around this problem, these systems use a batch model that re-computes all (or a large swath) of the aggregates all at once (often overnight) and incorporates many changes in the same run. Instead of incorporating the change due to a single row, these systems incorporated all the changes that happened that day.
However, time-series data is different.
With time-series data, changes aren’t randomly spread across the data space, but rather mostly occur at the current time. This property can be used to optimize the maintenance of materialized views that aggregate over time intervals. When recomputing aggregates because of data changes, our recomputation can batch many changes in a single run because those changes are close together in time in both your raw data and in the resulting aggregates.
Our vision is that by utilizing the properties of time-series data, we can make the maintenance of materialized views efficient enough that materialized views act like indexes. That is, we can make them up-to-date and guarantee that you’ll get the same answer from querying the materialized view as you would get from querying the underlying data.
An always-correct, always-current materialized view.
Eventually, we could even hide the existence of the view from the user.
Imagine that.
We create a materialization threshold that represents a point in time. Everything before this threshold has been materialized and is up-to-date. Everything afterwards has not yet been materialized and/or is incomplete.
We have a materialization job that runs on a schedule and is responsible for materializing more recent data, and moving the threshold forward in time. Yet the threshold is always kept behind the point-in-time where data changes are common.
When data that has already been materialized changes, the materialization is updated as part of the insert operation, keeping this materialization current. This does incur write amplification overhead, but since the materialization threshold lags the currently-changing area of data, this cost is rarely incurred.
During queries, we combine data from before the materialization threshold with a fresh aggregation of data after the materialization threshold. This gives you a completely up-to-date and correct result for all time ranges whether it is before or after the materialization threshold.
What we described above is our vision. We are making progress, but we aren’t quite there yet.
The release of TimescaleDB 1.3 includes a materialized view feature we call continuous aggregates. It includes a system to materialize data up until a threshold in the background. It also tracks changes to the underlying data that occurred before the threshold, and corrects the materialized aggregates next time the materialization process runs. Data is selected explicitly from the materialized view.
In upcoming releases we will add:
With these changes, we could even hide the existence of the materialized view from the user, and achieve our vision of treating materialized views like indexes.
If you want to help make this vision a reality please read the tutorial, try out this feature, and help us by submitting GitHub issues to let us know how we can improve. Also, always feel free to get in touch via email or Slack.
Finally, if you want to read more posts like this one, sign up for our mailing list below.