How Flowkey Scaled Its AWS Redshift Data Warehouse With Timescale
This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.
In this edition, Nikola Chochkov, lead data scientist at flowkey, shares how his team migrated from Amazon Redshift to TimescaleDB and is driving rapid growth and experimentation by analyzing the users’ behavioral data using TimescaleDB along with Metabase or Jupyter/Rmarkedown Notebooks.
About the Company
Flowkey is a leading app for learning to play the piano, with over 10 million registered users in more than 100 countries. The company was launched in 2015 and quickly became one of the global leaders in its category.
Here is a video from our founder, Jonas Gößling, explaining how it all started.
About the Team
We are a team of around 40 people, with more than 10 of us working in Data and Engineering.
We have a Marketing team (responsible for user acquisition, customer relationship management, collaborations, etc.), a Creative team (building all of our visual content, our design, and advertising), Course and Song teams (creating our in-app learning content—e.g., the courses series and the piano renditions of the songs in our library). We also have Customer Support, Product, Engineering, Data, and Operations teams.
Many of us take on more than one role, and for many of us, flowkey has been the first significant career step. Not for me personally, though, but still. 🙂
About the Project
We are a business that depends heavily on analytics for business decision-making. Our experimentation—a major driver of our growth—is powered by the data analysis of user behavioral data (app usage). This data consists of user events, which we track from our product.
“We realized we needed to scale our previous Amazon Redshift data warehouse model into a more suitable solution for categorical, time-series data analysis. We found out about TimescaleDB from being part of the PostgreSQL community”
For example, a Learn Session starts at a given timestamp for a user, and we record this event.
When we launch a new feature, we typically A/B test it and evaluate its impact based on measuring key performance indicators (KPIs), which are predefined for the test. Every day, we receive millions of incoming events, tracked around our product, in the format:
(user_id, event, timestamp, properties)
The properties
field is a schemaless object that depends on the particular event type that we track from the app. For example, a Learn Session event would have a songId
and a learnMode
, while a Subscription Offer interaction event would have a productId
, etc.
Choosing (and Using!) TimescaleDB
With time, we realized we needed to scale our previous Amazon Redshift data warehouse model into a more suitable solution for categorical, time-series data analysis.
We found out about TimescaleDB from being part of the PostgreSQL community, and when we were faced with the problem at hand, it was a natural way forward for us.
After doing our research, we realized that TimescaleDB suited our needs perfectly. Here's a list of our arguments:
- Our data analysts are well-versed in SQL and PostgreSQL.
- The events’ raw data is available in a PostgreSQL schema alongside all our other business intelligence data.
- TimescaleDB is an actively developed, open-source solution. It allowed us to deploy it on our self-hosted PostgreSQL data warehouse.
- A TimescaleDB hypertable model would allow us to accommodate the schemaless JSON structure of our events.
select event, platform, time, jsonb_pretty(data) from events_today limit 5;
event | platform | time | jsonb_pretty
------------------------------+----------+-------------------------+---------------------------------------------------------
SONG_OPEN_UI_ELEMENT_CLICKED | ios | 2022-11-03 00:00:00.034 | { +
| | | "songId": "E2kCpqHCwB2xYf7LL", +
| | | "context": "song-cover", +
| | | "listIndex": 6, +
| | | "routeName": "Songs", +
| | | "currentTab": "SongsTab" +
| | | }
ONBOARDING_QUESTION_VIEWED | web | 2022-11-03 00:00:00.145 | { +
| | | "context": "preferredCategories" +
| | | }
SONG_PLAYER_SCROLLED | ios | 2022-11-03 00:00:00.157 | { +
| | | "level": 1, +
| | | "songId": "Lui27TDJ4vZBevxzc", +
| | | "direction": "backwards", +
| | | "songGroupId": "vhisAJBkPwvn6tdoq", +
| | | "loopBoundsMs": null, +
| | | "finalPositionMs": 0, +
| | | "initialPositionMs": 24751 +
| | | }
AB_TEST_VARIANT_ASSIGNED | ios | 2022-11-03 00:00:00.249 | { +
| | | "variant": "CONTROL", +
| | | "experimentName": "player_onboarding_video_09_2022"+
| | | }
ONBOARDING_ANSWER_SUBMITTED | web | 2022-11-03 00:00:00.314 | { +
| | | "answers": [ +
| | | "dont-know" +
| | | ], +
| | | "context": "learningGoals" +
| | | }
- TimescaleDB offers a great set of SQL analytical functions.
- TimescaleDB offers continuous aggregates, which integrate very well with how we do analytics and real-time data monitoring.
- Data migration and update (e.g., renaming of events or JSON properties) are available.
“We use compression, which has cut our disk space usage by 28 percent”
Current Deployment & Future Plans
Our data warehouse is deployed on self-hosted machines and works well for us. We employ other PostgreSQL extensions that aren't currently supported by the Timescale cloud offering, which was important to us when we launched. These include Mongo FDW and Adjust’s iStore extension for cohort analysis data storage.
We employ TimescaleDB's awesome data retention (automated through a user action), and thanks to that, our most recent (and more relevant to our analytics) data is available to us on SSD chunks, while historical data is kept on HDDs.
Furthermore, we use compression, which has cut our disk space usage by 28 percent. Our data contains JSONB fields, which are difficult to be compressed. We are pretty happy with it, though, so it's a win. 🙂
When we do business analytics, we employ Metabase or Jupyter/Rmarkdown Notebooks to derive insights. We established a workflow of writing custom continuous aggregates for the duration of experiments, which are then easy to keep and fully deploy or discard, depending on the decision made for the experiment.
This allows us to iterate our experiments quickly and increase the bandwidth of change, which we can successfully bring to the product.
Roadmap
We just finished migrating our setup to a more powerful cluster of machines, which allowed us to benefit from the data tiering options mentioned above. Right now, our system is scalable, and we don't expect any major upgrades to this system to come up soon.
Advice & Resources
We recommend the Timescale documentation as well as the Slack Community.
Want more insider tips? Sign up for our newsletter for more Developer Q&As, technical articles, and tutorials to help you do more with your data. We deliver it straight to your inbox twice a month.
We’d like to thank Nikola and all of the folks at flowkey for sharing their story on how they’re improving their online piano lessons by analyzing millions of user events daily using TimescaleDB.
We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (@Ana Tavares), and we’ll go from there.