Sep 11, 2023
An insert performance up to 260 % higher, 54x faster queries, and simpler implementation when using TimescaleDB vs. MongoDB for time-series data.
Time-series data has exploded in popularity, and the value of tracking and analyzing how things change over time has become evident in every industry: DevOps and IT monitoring, industrial manufacturing, financial trading and risk management, sensor data, ad tech, application eventing, smart home systems, autonomous vehicles, and more.
But 2020 has provided us with the most personal example of how time-series data collection and analysis affects our daily lives, with billions of people across the globe becoming relentless consumers of time-series data, demanding accurate and timely information about the daily trend of various COVID-19 statistics.
The biggest challenge with storing time-series data? Scale, both in collecting data and storing it. Data accumulates quickly and requires a database that can keep up with a relentless stream of data from the systems you care about. We’ve shown previously that SQL and relational databases can reach petabyte-scale and beyond, but many developers' first inclination still goes to using a NoSQL database for their time-series data when scale is a requirement (perhaps due to the breakthroughs made by NoSQL databases in the early 2000s?)
Enter MongoDB as a time-series solution. MongoDB is among the best-known NoSQL databases, emerging at the end of the last decade to become the face of NoSQL and the foundation of a nearly $21 billion company (as of writing). MongoDB became popular as a simple document store for quickly prototyping and easily scaling web apps.
But, over the years, developers have started using MongoDB to fill all sorts of database needs across various domains, including using MongoDB to store and analyze time-series data at scale.
Here are a few examples of posts we’ve found on the topic of storing time-series data in MongoDB, with sources ranging from the official MongoDB blog to popular technical how-to sites, like Dev.to and Quora:
But is MongoDB really the right solution for time-series data? We decided to evaluate it for ourselves, with the obvious caveat that we are the creators of a competing product. Yet, you’ll see we try to keep our analysis as fair as possible, trying multiple approaches to storing time-series data in MongoDB.
Our conclusion is that while MongoDB’s JSON-like document store may make it a jack-of-all-trades type of database and perhaps a master of some (e.g., web applications), time series is not one of them. You’re better off going with a purpose-built time-series database for both performance and ease of use.
For this analysis, we evaluated MongoDB vs. TimescaleDB, the leading open-source relational database for time-series data (and our own product). Engineered atop PostgreSQL, TimescaleDB is designed for fast ingest, complex queries, and ease of use, with powerful tools for analysis, retention, and management of time-series data. Given its PostgreSQL foundation, it inherits the rock-solid reliability, tooling, and vast ecosystem of Postgres, as well as SQL, the commonly known, well-documented query language that’s popular for data analysis.
We evaluated two methods of using MongoDB as a time-series database:
We compared MongoDB and TimescaleDB across several dimensions:
Here’s a summary of our results. We go into much more detail about our methodology later in the post:
The first method, which we’ll call “Mongo-naive” throughout this post, has fast writes and is extremely simple to implement but offers dismal query performance, even on simple analytical queries.
The second method, which we’ll call “Mongo-recommended” throughout this post, demonstrates 550-6800% of the query performance of Mongo-naive (method 1). This performance difference makes it clear to see why it is the recommended method for storing time-series data in MongoDB, despite having slower ingest performance (1.5x slower) and more disk usage (16% more) than Mongo-naive.
However, this method has worse write performance, higher storage footprint, and higher implementation complexity than both Mongo-naive and TimescaleDB. Mongo-recommended also fails to deliver good query performance on more complex time-series queries compared to TimescaleDB.
More specifically, compared to MongoDB, TimescaleDB exhibits:
Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).
Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️ are appreciated!)
(P.S. If you’re looking for more comparisons of database performance for time-series data, read our studies comparing AWS Timestream vs. TimescaleDB and InfluxDB vs. TimescaleDB.)
In the remainder of this post, we’ll outline the methodology and results used to conduct a detailed set of benchmarks that compare TimescaleDB vs. MongoDB across inserts, queries, and ease of use.
As mentioned, we tested two methods for storing time-series data in MongoDB, and before diving into write and read performance numbers, let’s take a moment to examine each method in a bit more detail.
We’ll use the common time-series scenario of DevOps metrics (in this case, storing CPU metrics) to demonstrate how to store time-series data in MongoDB.
The first method, which we’ll refer to as “Mongo-naive,” is straightforward: each time-series reading is stored as a document. This is the first approach that would come to mind for most developers when trying to store time-series data in MongoDB, hence the name "naive".
So, for our given use case (see setup below) of monitoring CPU metrics, the JSON document looks like this:
{
“measurement” : “cpu”,
“timestamp_ns” : NumberLong(“1451606420000000000”),
“fields” : {
“usage_user” : 98.15664166391042,
“usage_guest_nice” : 85.53066998716453,
…
},
“tags” : {
“hostname” : “host_2019”,
“datacenter” : “us-east-1b”,
…
}
}
Conceptually and in implementation, this method is very simple, so it seems like a tempting route to go: batch all measurements that occur at the same time into one document along with their associated tags and store them as one document.
Indeed, this approach yields very good write performance, and it is fairly easy to implement (in our tests, we saw noticeably better results than InfluxDB reports in their comparison). However, we’ll see later that, even with indexes, the query performance with this method leaves much to be desired.
This second method is one that MongoDB users and the company itself recommend when it comes to time series, which we call “Mongo-recommended.” The engineering idea behind this method is clever: create a document for each device, for each hour, which contains a 60 by 60 matrix representing every second of every minute in that hour. This document is then updated each time a new reading comes in, rather than doing a new document insert:
{
"measurement" : "cpu",
"doc_id" : "host_15_20160101_00", // for quick update lookup
"key_id" : "20160101_00", // YYYYMMDD_hh
"tags" : {
"hostname" : "host_6",
"datacenter" : "ap-southeast-1b",
...
},
"events" : [
[
{
"timestamp_ns" : NumberLong("1451606420000000000"),
"usage_user" : 98.15664166391042,
"usage_guest_nice" : 85.53066998716453,
...
},
... // (59 elements elided)
],
... // (59 elements elided)
]
}
This method makes it possible to do some efficient filtering when it comes to queries but comes with a more cumbersome implementation and decreased (albeit not terrible) write performance. For example, to efficiently manage writes, the database must keep a client-side cache of which documents are already made so that a more costly “upsert” (i.e., insert if it doesn’t exist; otherwise, update) pattern is not needed.
Further, while queries for this method are typically more performant, we found that designing the query in the first place requires more effort than method 1, especially when reasoning about which aggregate documents can be filtered/pruned.
Finally, this approach limits the granularity of your data. In particular, if you wanted to support millisecond precision, you would have to change the design to aggregate on a minute basis, as the max document size in MongoDB (16MB) does not lend itself to further nesting.
Beyond millisecond precision is probably infeasible, as MongoDB’s document size limitation means that you would probably need to create a document per second for each device, and the nesting required would make the query construction process extremely complex.
Because the data in our evaluation was at the granularity of seconds, not milliseconds, and given the query performance we saw (as detailed in the next section), we ultimately decided that this method is probably the best method for comparison against TimescaleDB. (We include Mongo-naive write and read performance numbers to show how we reached this conclusion.)
Once again, we’ll use the common time-series scenario of DevOps metrics to benchmark the performance of the two MongoDB methods and TimescaleDB.
Note: We've released all the code and data used for the benchmarks below as part of the open-source Time Series Benchmark Suite (TSBS).
For comparing both insert and read latency performance, we used the following setup:
To benchmark both write and read performance, we used the following dataset:
cpu-only
use casetags_id
and hostname
columns and ordered by time
descending and usage_user
columns.Because NoSQL databases typically trade off some guarantees of relational databases, one might expect MongoDB to achieve better write performance/throughput, making it an inviting choice for ingesting time-series data, which can be at a rate of thousands of readings per second (or more).
While it’s true that plain PostgreSQL does tend to lose write throughput as the dataset size grows (See our PostgreSQL vs. TimescaleDB benchmark results for more), TimescaleDB’s unique chunking mechanism keeps write performance high (learn more about chunks).
As a result, TimescaleDB outperformed both MongoDB configurations by a significant margin: TimescaleDB saw 169 % better insert performance compared to the Mongo-naive method and 260 % better write performance compared to the MongoDB-recommended.
All three setups achieve write performance of over 1 million metrics per second. However, only TimescaleDB seems suitable for performance-critical, time-series use cases, as it achieved very high ingest rates on the high cardinality benchmark dataset, with an average insert rate of 2.7 million metrics per second. The sluggishness of the Mongo-recommended method’s ingest rate is likely due to the extra cost of occasionally creating new, larger documents (e.g. when a new hour or device is encountered).
Given that time-series data piles up quickly and the costs associated with storing large amounts of data, it’s also worth exploring the disk storage footprint of each database configuration:
TimescaleDB uses 9 % of the disk space of the Mongo-naive and 8 % of the disk space of Mongo-recommended methods. This is thanks to the TimescaleDB’s novel hybrid row/columnar storage approach, which uses Gorilla compression for floats; delta-of-delta and simple-8b with run-length encoding for timestamps and integer-like types; whole-row dictionary compression for a few repeating values, with LZ compression on top; and LZ-based array compression for all other types.
(Interested readers can learn more about how TimescaleDB’s native compression works, as well as this explanation of time-series compression algorithms and how they work.)
Moreover, for the ~1 billion benchmark dataset, the Mongo-recommended method used more disk space than both the Mongo-naive method and TimescaleDB, making it worse than Mongo-naive on insert performance.
That said, we recommend doing an honest analysis of your insert needs. If your insert performance is far below these benchmarks (e.g., if it is 2,000 rows/second), then insert performance will not be your bottleneck, and this comparison becomes moot.
Before we compared MongoDB against TimescaleDB, we first evaluated the query performance between the two MongoDB methods.
By this point, Mongo-naive had demonstrated better write performance with simpler implementation and lower disk usage, but we suspected that Mongo-recommended would outperform Mongo-naive for query performance, justifying its recommendation by the MongoDB team and users.
And, if there were a clear winner between the two methods for simple queries, we could save ourselves some time by not implementing our full query set against both methods.
So, we first compared the two MongoDB methods using three “single rollup” queries (single groupby on time), one “double rollup” query (double groupby on time and device hostname), and one “aggregate” query (max reading over a time period). All queries were run with 1000 different parameter permutations (i.e., with random time ranges and hosts), from which we recorded the mean time for each database respectively.
Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of MongoDB-recommended compared to MongoDB-naive.
Here are the results:
Mongo-recommended outperforms Mongo-naive in all of the queries above, demonstrating 550-6,800 % of the performance of Mongo-naive. This performance difference makes it clear to see why it is the recommended method for storing time-series data in MongoDB, despite having slower ingest performance (1.5x slower) and more disk usage (16 % more) than Mongo-naive.
Given this significant difference in query performance, compared to the modest difference in write performance and disk storage footprint, we decided that this approach was probably the best setup for storing time-series data in MongoDB. Thus, for the remainder of this post and our analysis, we use the “Mongo-recommended” setup whenever benchmarking MongoDB.
(Note to readers: If all you want to get out of this post is the best way to store time-series data in MongoDB, here’s your answer: use the “Mongo-recommended” method. If ingest performance is more important to you than query performance, use “Mongo-naive.” But, if you want even better performance than either Mongo method, you may want to continue reading.)
Having settled on which MongoDB method is best, let’s evaluate MongoDB vs. TimescaleDB for querying time-series data.
The best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common query patterns. The results shown below are the average from 1000 queries with different parameter combinations (time range, device ID) for each query type.
Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to MongoDB (highlighted in orange when TimescaleDB is faster, in green when MongoDB is faster).
Let’s unpack the results for each query type below:
For simple rollups (i.e., groupbys), when aggregating one metric across a single host for 1 or 12 hours or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB performs comparably to or outperforms MongoDB.
In particular, when aggregating one or more metrics on a single device for a single hour, the two databases show fairly equal performance. However, when aggregating one or more metrics across multiple devices for multiple hours, TimescaleDB shows between 208 % and 302 % of the performance of MongoDB.
For calculating a simple aggregate (i.e., finding the maximum value) for metrics from one or more devices, TimescaleDB outperforms MongoDB. In our benchmark, TimescaleDB demonstrates 396 % of the performance of MongoDB when aggregating 8 metrics across 4,000 devices and 195 % when aggregating one metric across 4,000 devices.
While single rollups and aggregates are somewhat comparable across the two databases, other more complex queries are not. For double rollups aggregating metrics by time and another dimension (e.g., GROUPBY time, deviceId), TimescaleDB shows large gains.
When aggregating one metric per device, per hour, for some 24-hour window, TimescaleDB showed 1,507 % (or 15x) the performance of MongoDB.
When aggregating 10 metrics, Timescale showed 1,327 % (or 13x) of the performance of MongoDB.
The largest performance difference came when aggregating 5 metrics per device, per hour, for some 24-hour window, where TimescaleDB achieved 2,149 % the performance of MongoDB, or 21x.
Notice that these queries take in the order of tens of seconds (rather than milliseconds), so a 13-21x performance gain is very noticeable.
For complex queries that go beyond simple rollups or aggregates, the comparison is much more clear-cut: TimescaleDB vastly outperforms MongoDB here (in some cases, more than 50 times faster).
The first complex query (‘lastpoint’) finds the latest reading for every device in the dataset. This query type is commonly used in IoT and DevOps for analysis and monitoring.
While our dataset has all devices reporting at consistent intervals, this query can be troublesome to implement in the general case because it could be that some devices have not reported in quite a long time, potentially causing a lot of documents (MongoDB) or rows (TimescaleDB) to be scanned.
We are able to do some clever query construction in both to get a list of distinct devices, which allows both setups to stop searching when every device has a point associated with it.
We utilize JOINs in both systems. However, while MongoDB does support JOINs, they are not as natural to work with or "feature-full" as they are for relational databases like TimescaleDB. For the more complex lastpoint query, TimescaleDB shows 5,399 % (or 54x) of the performance of MongoDB.
The second complex query (‘groupby-orderby-limit’) does a single rollup on time to get the MAX reading of a CPU metric on a per-minute basis for the last 5 intervals for which there are readings before a specified end time. This is another type of query common in DevOps and IT monitoring workloads.
This is tricky because those last 5 intervals could be the 5 minutes before the end time, or if there is no data for some minute periods (i.e., “gaps”), they could be spread out, potentially needing a search from the beginning just to find all 5.
In fact, a full table scan was the query strategy needed for MongoDB, while TimescaleDB has intelligent planning to utilize its indexes. This resulted in TimescaleDB showing 2,108 % (or 21x) of the performance of MongoDB.
Performance is not the only aspect to consider when deciding which database to use to store time-series data. Developer experience and implementation complexity are also important factors.
With that in mind, we compared the query language differences between TimescaleDB and MongoDB, using the complex query “groupby-orderby-limit” from our performance analysis
Once again, we may be biased, but we found SQL —with 5 lines of code—much simpler than MongoDB's document format—with 72 lines of code. This is a crucial criterion for sustainable software development since humans create, maintain, and use these systems at the end of the day. This is another point in favor of SQL in the ongoing NoSQL vs. SQL debate.
TimescaleDB query (SQL):
SELECT date_trunc('minute', time) AS minute, max(usage_user)
FROM cpu
WHERE time < '2016-01-01 19:47:52.646325 -7:00'
GROUP BY minute
ORDER BY minute DESC
LIMIT 5
And here’s that same query expressed in MongoDB.
MongoDB query:
[
{
$match: {
measurement: "cpu",
key_id: {
$in: ["20160101_00", "20160101_01", "20160101_02", "20160101_03", "20160101_04", "20160101_05", "20160101_06", "20160101_07", "20160101_08", "20160101_09", "20160101_10", "20160101_11", "20160101_12", "20160101_13", "20160101_14", "20160101_15", "20160101_16", "20160101_17", "20160101_18", "20160101_19", "20160101_20", "20160101_21", "20160101_22", "20160101_23", "20160102_00", "20160102_01", "20160102_02", "20160102_03", "20160102_04", "20160102_05", "20160102_06", "20160102_07", "20160102_08", "20160102_09", "20160102_10", "20160102_11", "20160102_12", "20160102_13", "20160102_14", "20160102_15", "20160102_16", "20160102_17", "20160102_18", "20160102_19", "20160102_20", "20160102_21", "20160102_22", "20160102_23", "20160103_00", "20160103_01", "20160103_02", "20160103_03", "20160103_04", "20160103_05", "20160103_06", "20160103_07", "20160103_08", "20160103_09", "20160103_10", "20160103_11", "20160103_12", "20160103_13"]
}
}
},
{
$project: {
_id: 0,
key_id: 1,
tags: "$tags.hostname",
events: 1
}
},
{$unwind: "$events"},
{
$project: {
key_id: 1,
tags: 1,
events: {
$filter: {
input: "$events",
as: "event",
cond: {
$and: [
{$gte: ["$$event.timestamp_ns", 1451606400000000000]},
{$lt: ["$$event.timestamp_ns", 1451827606646325489]}
]
}
}
}
}
},
{$unwind:$events},
{
$project: {
time_bucket: {
$subtract: [
"$events.timestamp_ns",
{$mod: ["$events.timestamp_ns", 60000000000]}
]
},
field: "$events.usage_user"
}
},
{
$group: {
_id: "$time_bucket",
max_value: {$max: "$field"}
}
},
{$sort: {_id: -1}},
{$limit: 5}
]
Two things that jump out besides the verbose JSON syntax:
In this example, the SQL query is far shorter and simpler, making it easier to comprehend and debug. This is not the only advantage of using SQL (and TimescaleDB) to query time-series data: SQL has a rich tradition and history, including familiarity among millions of developers and a vibrant ecosystem of tutorials, training, and community leaders.
In short, as the third-most commonly used programming language among developers, choosing SQL means you’re never alone.
TimescaleDB delivers 260 % higher insert performance, up to 53x faster queries and better developer experience vs. MongoDB
Is it surprising that TimescaleDB, a database purpose-built for time-series data, outperforms MongoDB, a general-purpose document store when it comes to time-series data?
Not necessarily—but there are enough blogs, talks, and other material out there about using MongoDB for time-series data that we felt we needed to do an evaluation.
We understand that for many users, MongoDB offers the benefit of being easy to learn and quick to set up. Yet, for time-series data, setting up MongoDB to deliver the performance your workloads require is not simple, as we’ve shown in our analysis, and requires careful thought about your schema design.
If you simply dump each reading into a new document, you’re in for longer and longer wait times as more data accumulates and you want to query it. Moreover, implementing MongoDB's recommended time-series approach requires writing client-side code and using fairly verbose queries.
We may be biased, but we'd contend that rather than doing all of the above and forcing MongoDB—a general-purpose document store—to behave like a time-series database, opting for a database built specifically for time-series data is a better use of time and resources.
And, as we've shown, when it comes to time-series workloads, TimescaleDBa purpose-built time-series database—delivers significantly better results on every dimension. If you're weighing your options, based on our analysis, TimescaleDB is the clear choice.
P.S. If you’re looking for more comparisons of database performance for time-series data, read our studies comparing Amazon Timestream vs. TimescaleDB and InfluxDB vs. TimescaleDB.
Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).
Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️ are appreciated!)
Join our Slack community to ask questions, get advice, and connect with other developers (our co-founders, engineers, and passionate community members are active on all channels).