Jan 09, 2024
I am currently the chief data scientist at Rasgo. Before joining Rasgo, I worked as a data scientist and data science leader for over 10 years across various fields. I received my Ph.D. in Systems and Industrial Engineering with a focus on decision-making under uncertainty and probability modeling from the University of Arizona. After my Ph.D., I was an assistant professor of Applied Mathematics at Towson University, where I began working on data science problems.
As my career progressed, I found myself doing less and less technical work in data science and more time spent on team building, interacting with management and customers, and basically talking about data science instead of doing data science. This culminated in 2016 when I took a role with DataRobot to help their prospects and customers use DataRobot’s machine learning platform to implement their data science initiatives successfully.
While this was, and still is, fulfilling work, I missed the technical work that originally drew me into data science. DataRobot encouraged us to do this (and leverage their product). In short, I became the DataRobot Marketing team's go-to content writer about sports analytics and betting. As part of this, I would build machine learning models to predict the outcome of sporting events. I regularly wrote blogs on my predictions for Grand Slam tennis, the Super Bowl (and NFL playoffs), the Champions League knockout stage, and the Premier League.
Because of this and the contacts developed over the years as a data scientist, multiple people have asked me to advise and even build proofs of concept of machine learning pipelines to automate both sports betting predictions and market models (from financial markets to sports betting markets in aid of monetizing these sports predictions).
In my spare time, I have been building proofs of concept of machine learning pipelines to pull relevant data (this could be sports statistics, historical sports betting odds, or financial market pricing data), use machine learning to make predictions, and share those predictions back to the organization to help them make decisions or bets.
Using PostgreSQL, I would write the original data, the engineered features for modeling, and the final predictions into the database for model monitoring and future model development.
In all of these cases, the underlying data is event data and has time-series characteristics. This means that as a data scientist, I was interested in capturing how the data changed over time. For example, when modeling a sports betting market, I was interested in how the betting line or odds would change over time. Was it rising, falling, oscillating, etc.? Capturing this information as machine learning features was key to my models performing well. I typically used a Python library called tsfresh to build these features.
As the amount of data grew, I spent most of the time in the pipeline, pulling data from the database, building the features, and pushing the results back to the database.
As I was first working on these sorts of problems in my spare time, DataRobot asked me to lead their go-to-market efforts in the sports and entertainment space. In this new role, I began talking with several Formula 1 teams about possibly using DataRobot in their workflows.
“During the investigation of TimescaleDB, I realized I could use PostgreSQL’s support of custom extensions built in Python to allow me to call tsfresh functions directly within the database without the expense of moving the data out or pulling it back in”
As part of these conversations, I learned about Kdb+ from KX Systems. This is a high-performance columnar time-series database used by F1 teams to store and process sensor data collected from their cars (and simulations). Financial institutions also use it to store and process market data. More intriguing from my point of view was that Kdb+ included a reimplementation of tsfresh within the database.
I understood the power of Kdb+, but its price was well out of my budget for these proofs of concept. Still, I liked the idea of a database that supported time-series data and could process it efficiently.
time_bucket()
and custom extensions for easier time-series aggregationsI have been using PostgreSQL for years and including it in the technology stack I used in these proofs of concept. While other databases supported time series, TimescaleDB combined the familiarity of PostgreSQL, making it the clear choice.
During the investigation of TimescaleDB, I realized I could use PostgreSQL’s support of custom extensions built in Python to allow me to call tsfresh functions directly within the database without the expense of moving the data out or pulling it back in.
In addition, TimescaleDB’s time_bucket() function would allow me to perform my time-series aggregations for arbitrary time intervals (eg., on a daily or weekly basis) as opposed to estimating the time period from a set number of rows. This was a huge improvement over using either tsfresh in Python (where it worked at a set number of rows) or a window function in PostgreSQL.
I then implemented all of the tsfresh functions as custom extensions in Python and built custom types to pass the necessary information into the function and custom aggregations to perform the aggregations to get the data ready for the time-series feature.
As expected, there were significant benefits to moving this processing into the database. In addition to eliminating the time to extract and insert the data, the database handled running these calculations in parallel much more simply than trying to manage them within Python.
“By releasing this as open source, I hope to help anyone [...] working on any sort of time-series problem to be able to fully leverage the power of TimescaleDB to manage their time-series data”
It was not without its drawbacks, however. First, PL/Python is untrusted in PostgreSQL and limits the types of users that could use these functions. More importantly, each time the function was called, the Python interpreter was started, and the tsfresh library and its dependencies were loaded. This meant that most of the processing time was spent on getting ready to perform the calculations instead of the actual calculations. The speedup was significantly less than expected.
At this point, I saw promise in this approach but needed to make it faster. This led me to reimplement the tsfresh functions in C and create custom C extensions for PostgreSQL and TimescaleDB. With these C extensions, I was able to generate a performance improvement between 10-100 times faster than the corresponding performance of the PL/Python version.
When I saw this performance, I felt I needed to clean up my code and make it available to the community as open source. This led to two packages: etu, which contains the reimplementation of tsfresh feature calculators in C, and pgetu, which includes the C extension that wraps etu and makes it available as functions in PostgreSQL and TimescaleDB.
As an example, these functions can be called in a SELECT statement in TimescaleDB as:
SELECT
time_bucket(interval '1 months', date) AS month,
absolute_sum_of_changes(amount, date) AS amount_asoc,
approximate_entropy(amount, date, 1, 0.1) AS amount_ae,
fft_aggregated(amount, date, ARRAY['centroid', 'variance', 'skew', 'kurtosis']) as ammount_fft_agg,
median(amount) as median_amount,
mode(amount) as mode_amount
FROM transactions
GROUP BY month
ORDER BY month;
Here’s another example: this is looking at the power output of a windmill based on wind speed.
SELECT TIME_BUCKET(interval '1 hour, time) AS hour,
autocorrelation(windspeed, time, 1) AS windspeed_autocorrelation_1,
autocorrelation(output, time, 1) AS output_autocorrelation_1,
count_above_mean(output) AS output_cam,
count_below_mean(output) AS output_cbm,
energy(windspeed, time) AS windspeed_energy,
fft_coefficient(windspeed, time, ARRAY[1, 1, 2, 2], ARRAY['real', 'imag', 'real', 'imag'] AS windspeed_fft_coeff_json,
number_cwt_peaks(windspeed, time, 5) AS windspeed_number_wavelet_peaks_5,
number_cwt_peaks(output, time, 5) AS output_number_wavelet_peaks_5
FROM sensor_data
GROUP BY hour
ORDER BY hour;
This makes it really easy to perform time-series feature engineering directly in the database.
By releasing this as open source, I hope to help anyone working as a data scientist on time-series-like problems quickly generate features directly within their database and anyone else working on any sort of time-series problem to be able to fully leverage the power of TimescaleDB to manage their time-series data.
As an open-source project, I hope to see people benefit from this in their personal projects and hopefully find enough value to be interested in helping improve both of these libraries going forward.
This project is the culmination of about one year of part-time development work. I spent the first third building a version using Python, the following third understanding how to build C extensions, writing proof of concept versions of some of the functions, and testing them to determine if the speedup was worthwhile (it was a 10x to 100x in extreme cases).
Regarding feedback, I have just released it and received a small amount of positive press on the PostgreSQL subreddit but limited feedback from others. I have also shared it with a number of the data scientists in my network, and the feedback has been overwhelmingly positive.
Currently, etu and pgetu support most of the feature calculators in tsfresh. My next step is to implement the approximately 15 calculations that I have not yet finished. Once that is done, I would like to increase the functions these libraries support.
If you need more information about this project or want to discuss it, I can be reached and followed on LinkedIn and Twitter.
We’d like to thank Andrew for sharing his story on how he is creating proofs of concept of machine learning pipelines for time-series forecasting using TimescaleDB.
We’re always keen to feature new community projects and stories on our blog. If you have a project you’d like to share, reach out on Slack (@Ana Tavares), and we’ll go from there.