Mar 10, 2021
Posted by
Rob Kiefer
Taking the BS out of benchmarking with a new framework released by TimescaleDB engineers to generate time-series datasets and compare read/write performance of various databases.
As engineers look to open-source databases to help them collect, store, and analyze their abundance of time-series data, they often realize that picking the right solution is harder than they originally thought.
And with time-series as the fastest growing category of databases in the past 24 months, picking the right solution is more important to more people than ever before.
It can be difficult to choose between databases because 1) each database contains a different set of “original” features 2) there hasn’t been a general purpose tool available to compare performance of one system against another for time-series workloads¹.
Until now.
Similar to the reasoning behind Yahoo!’s introduction of the Yahoo! Cloud Serving Benchmark (YCSB) for comparing cloud datastores in 2010, we thought it was time for a new standard of benchmarking for the time-series database market. And with that we give you the Time Series Benchmark Suite (TSBS).
TSBS is a collection of Go programs that are used to generate time-series datasets and then benchmark read and write performance of various databases. Along with having a defined framework for testing databases against each other, the goal of this project is to make the TSBS extensible so that a variety of use cases (e.g., devops, finance, etc.), query types, and databases can be included and benchmarked.
We thought this TSBS was important to build for a few reasons:
The initial release supports benchmarking with TimescaleDB, MongoDB, InfluxDB, and Cassandra.
We’ve also made it pretty simple to write a new interface layer in order to benchmark a new database. (If you are the developer of a time-series database and want to include your database in the TSBS, feel free to open a pull request to add it!)
(To see TSBS in action, check out our blog posts comparing TimescaleDB vs. Cassandra and vs. MongoDB for time-series data.)
Using TSBS for benchmarking involves three phases:
(Note: TSBS is used to benchmark bulk load performance and query execution performance, but currently does not measure concurrent insert and query performance. We are in the process of working on this feature and will update users when it becomes available.)
Currently at time of publishing this post, TSBS supports one use case, DevOps, in two forms. The full form is used to generate, insert, and measure data from 9 “systems” that could be monitored in a real world DevOps scenario (e.g., CPU, memory, disk, etc). The alternate form focuses solely on CPU metrics for a simpler, more streamlined use case.
Example of insert:
time,per. metric/s,metric total,overall metric/s,per. row/s,row
total,overall row/s
… # many lines before this
1518741528,914996.143291,9.652000E+08,1096817.886674,91499.614329,9.652000E
+07,109681.788667
1518741548,1345006.018902,9.921000E+08,1102333.152918,134500.601890,9.92100
0E+07,110233.315292
1518741568,1149999.844750,1.015100E+09,1103369.385320,114999.984475,1.01510
0E+08,110336.938532
Summary (how many metrics and rows where applicable were inserted, the wall time it took, and the average rate of insertion):
loaded 1036800000 metrics in 936.525765sec with 8 workers (mean rate
1107070.449780/sec)
loaded 103680000 rows in 936.525765sec with 8 workers (mean rate
110707.044978/sec)
Example of query execution:
run complete after 1000 queries with 8 workers:
TimescaleDB max cpu all fields, rand 8 hosts, rand 12hr by 1h:
min: 51.97ms, med: 757.55, mean: 2527.98ms, max: 28188.20ms, stddev:
2843.35ms, sum: 5056.0sec, count: 1000
all queries :
min: 51.97ms, med: 757.55, mean: 2527.98ms, max: 28188.20ms, stddev:
2843.35ms, sum: 5056.0sec, count: 1000
wall clock time: 633.936415sec
Within the next few months, we will be working on creating new use cases and adding additional databases for comparison — and we certainly hope users will take advantage of this tool as well. Ultimately, we want to help prospective time-series database administrators and engineers find the best solution for their needs and their workloads.
Interested in benchmarking time-series databases? We encourage you to take a look at the resources we have available: Check out the GitHub page. Read our benchmarking blog posts on Cassandra and MongoDB (which use the TSBS). Learn more about TimescaleDB and how to get started.
Like this post? Please recommend and/or share. Want to learn more? Join our Slack community, follow us here on Twitter, check out our GitHub, and sign up for the community mailing below.
[1] While based on a tool from InfluxData, that tool never seemed marketed as a general comparison tool for a wide audience, a goal of ours. For full disclosure, the main similarities and differences are as follows: The usage methodology (i.e., pre-create data sets and query sets with separate load and query stages) originate from InfluxData’s comparison write-up. While most of the common loaders follow the same basic scaffold as InfluxData’s original tool, they have been refactored to share a lot more code and extended with extra measurement options. A similar refactoring was done for query runners. However, due to the addition of 4 new query type groups and different approaches to storage, a lot of code was changed/added to Cassandra binaries, the code for MongoDB is almost wholly rewritten, and the code for TimescaleDB is completely new. Most of the InfluxDB-specific code remained the same.
[2] The issues we found with the current InfluxDB benchmarker revolved around maintainability and extensibility, as well as the fact that it overlooked common patterns for certain databases. For example, when we benchmarked against MongoDB, we implemented a much faster time-series data structure (similar to one suggested by MongoDB itself) compared to the one InfluxData provided.