Dev Q&A

Dec 01, 2024

How Hopthru Powers Real-Time Transit Analytics From a 1 TB Table

How Hopthru Powers Real-Time Transit Analytics From a 1 TB Table

This is an installment of our “Community Member Spotlight” series, in which we invite our customers to share their work, spotlight their success, and inspire others with new ways to use technology to solve problems.

In this edition, Brock Haywood, chief technical officer (CTO) at Hopthru, explains how his team uses Timescale to help transit agencies analyze and make data-driven decisions to improve public transportation. As the leader of a small team tackling such a massive endeavor and powering real-time analytics from a 1 TB data table, Brock also shares how managed services, like Timescale Cloud, relieve the load for developers.

About Hopthru and APC Data

Hopthru's mission is to help make public transport the most compelling mode of urban travel. The main product we offer is an analysis suite for reviewing ridership data. The ridership data we ingest is called APC, which is automated passenger counting. 

Most of our customers are governmental agencies located in the US. At any time, but particularly during COVID, when resources were limited, the efficiency of routes and the overall transit system was crucially important for these agencies.

About the Team

Hopthru is a small team of five people based in Seattle, Washington State. I joined the organization about three years ago; it will be three years this June. I’m the company's CTO, but since we’re a five-person team, it’s predictably a flat organization. There are three of us on the engineering team, and two people are working in sales.


Collecting and Analyzing APC Data

At the very least, most agencies in North America will have some subset of their fleet of buses and trains fitted with what we call APCs. If you get on a bus and look up at the door, there’ll be this little sensor counting the number of people on and off the bus. 

When you correlate that data—so that timestamped on and off data (I think you can probably see where Timescale comes into this)—with another system that is on virtually all buses in North America, which is called the automated vehicle location (AVL) system, we have these two datasets. They often come as already pre-merged data, with a location, the time, and the number of people who got on or off the vehicle.

With this data, we can paint a picture for the agency: "This is what people were doing on your network at any given time.” The planners at that agency can use that to understand which routes are underserved because they have particularly high bus loads. A load refers to the number of people in the vehicle at any given time. 

What we show these planners in agencies is, for example, that a vehicle running a particular route has a really high load, whereas another vehicle running a different route has a low load but runs at a higher frequency. This is a good example of a case where agencies may want to optimize their network to best utilize their resources.

When users log into the Hopthru portal, they have a complete system view of the data from the last four weeks. They can see their average daily boardings and drill into specific routes. A given route will have a collection of trips (an instance where a vehicle ran that route), each route’s number of stops, its schedule, the number of people who got on and off the vehicles, and at which stop, among other information. 

Hopthru's dashboard page, showing a map with routes and stats in bar graphs, such as average daily boardings by day and period
Hopthru's portal

Part of this data is visually demonstrated on the application’s map with different colors, for example, darker colors, where most of the traffic is. Users can also see data like total boardings and the average load and make comparisons to the previous period to see the delta between the two (the percentage change). This allows agencies to see how their network has changed over time and historically analyze changes in their schedule or other variables.

The Data Pipeline

The data ingestion process is simply due to the nature of the industry. Transit agencies, as a whole, don’t usually have large IT teams, which means that our data ingesting pattern tends to be pretty bespoke. 

However, the most common approach is to install software that runs on their network and query their databases. For example, their vehicles will run their day, return to the barn at night, and connect to the Wi-Fi. They will upload all the data they have on their board computer, which will go into some database. We then query that database and upload it to Hopthru. This approach certainly has variations, but this is the most common one.

We ingest this data into what we call our boardings tables, which is where Timescale comes into play. This is our primary hypertable storing very classic time-series data, and it’s about 1 TB.

Choosing (and Using!) TimescaleDB

I joined Hopthru after Timescale became part of the stack. Before, the team stored the time-series data in a straight PostgreSQL table (Heroku Postgres) and had issues with data aggregations that were incredibly slow. Migrating to Timescale was pretty straightforward.

“Timescale immediately offered a performance boost for these [time-series] queries that are run over quite large ranges.”


The fact that we could run plain old PostgreSQL queries on top of the time-series data was probably the critical factor in selecting Timescale. For us, it was purely a performance thing. Our stack comprised Python, Django, PostgreSQL, and a little bit of Redis as a cache layer. With a small team that was not the most aware of how PostgreSQL and Django queries work under the covers, Timescale immediately offered a performance boost for these queries that are run over quite large ranges. 

Two of the top features we use are hypertables and continuous aggregates. We don’t have hard numbers, but implementing continuous aggregates reduced many mission-critical queries from minutes to seconds.

We have two hypertables and five continuous aggregates that sit on top of the boardings table and are focused on different ways customers will typically query it. Our customers can, for example, query an individual trip, and those continuous aggregates are structured so that they narrow the scope to the most commonly used sets of filters.

SELECT
   sum(hopthru_ca_boarding_routes_trips_time_ranges.ons),
   sum(hopthru_ca_boarding_routes_trips_time_ranges.offs),
   max(hopthru_ca_boarding_routes_trips_time_ranges.maxload),
   sum(hopthru_ca_boarding_routes_trips_time_ranges.load) / sum(hopthru_ca_boarding_routes_trips_time_ranges.load_count) AS avg_load,
   sum(hopthru_ca_boarding_routes_trips_time_ranges.passenger_distance_travelled),
   count(DISTINCT hopthru_ca_boarding_routes_trips_time_ranges.service_date) AS total_days
FROM
 hopthru_ca_boarding_routes_trips_time_ranges
 JOIN hopthru_routemaster ON hopthru_ca_boarding_routes_trips_time_ranges.route_master_id = hopthru_routemaster.id
WHERE
 (
   (
     NOT hopthru_routemaster.hidden
     AND hopthru_ca_boarding_routes_trips_time_ranges.trip_master_id IS NOT NULL
   )
   OR hopthru_routemaster.non_gtfs
 )
 AND hopthru_ca_boarding_routes_trips_time_ranges.route_master_id IN ($1 :: uuid)
 AND hopthru_ca_boarding_routes_trips_time_ranges.service_date >= $2 :: date
 AND hopthru_ca_boarding_routes_trips_time_ranges.service_date <= $3 :: date
 AND hopthru_ca_boarding_routes_trips_time_ranges.agency_id = $4 :: uuid
 AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour >= $5 :: time
 AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour < $6 :: time;


So, we have one for routes and another for trips. The stop-level data tends to be the most intensive in terms of query latency because there’s no hierarchy to a stop, and they can be part of several routes. A more extensive network can have thousands and thousands of stops, which is why we use a continuous aggregate.

In terms of Timescale’s native features, those are the ones that we certainly use the most. We don’t make heavy use of compression because our current schema doesn’t work very well with it, but that’s on my to-do list. 

“[...] We also use compression and tiered storage to keep our storage costs way down. Tiered storage is incredibly effective.”

We actually have a second Timescale database for ingesting data from a different pipeline where customers upload data from their network to ours. We also run a second service for a very specific use case for agencies that have a style of sensor that can communicate directly to the Internet. We operate an MQTT broker that ingests data into a standalone database, which also runs on top of a hypertable. Then, our main pipeline queries that system to get the data, acting as an intermediary. 

On that particular database, which is a separate instance altogether, we made use of more features because we built it with Timescale in mind, so I was able to structure the schema to be more effective. We have a single hypertable there, but we also use compression and tiered storage to keep our storage costs way down. Tiered storage is incredibly effective.

Editor’s Note: You can now change your compression settings on the fly.

In this database, we have about 5 GB of data: 2 GB of uncompressed data, another 2 GB of compressed data, and ~1 GB in the low-cost storage tier. Our access pattern here is such that we really only query data that came in the last month, so compression and tiered storage are super helpful because we don't have a data archival strategy. Also, a customer, as it currently stands, could request data from the start of time.

I’d say we leaned more on the Timescale Support team to help optimize queries where we’ve seen poor performance. We work with them to understand what the database is doing under the covers to improve the performance. I’ve actually used Insights a lot over the last three months, mostly looking at execution time.

“In the 12 years of my startup career, I have almost exclusively worked with PostgreSQL. Having something like Timescale in the cloud allows us to focus on building the application rather than building the infrastructure. Because we're such a small team, we don't have the resources to spend time managing services.”

I look at the union of execution time and P95 time to identify the poorest performers. But, given the nature of our analytics platform, that doesn’t necessarily help me narrow down how the user got to that case, right? So, manually marrying the Timescale Insights data with New Relic data, which has more of that user path, makes it clear how to reproduce the problematic case and drill into it.

I have been very pleased working with Timescale. In the 12 years of my startup career, I have almost exclusively worked with PostgreSQL. Having something like Timescale in the cloud allows us to focus on building the application rather than building the infrastructure. Because we're such a small team, we don't have the resources to spend time managing services.  

I have no experience with other time-series data providers, so I couldn't effectively recommend one or another. But I can say that our experience has been fantastic, both in terms of the support and the low-touch nature of the cloud offering.

The Hopthru Roadmap

The main thing we are focused on for our roadmap is building out a product we call Hopthru Cleanse, where we take more of the raw APC data.

So, typically, the APC data we get from agencies will be pre-correlated to route, trip, and stop. Our product’s focus right now is to sort of do away with that pre-correlated data and correlate it ourselves because a lot of the data we get from agencies from their third-party vendors does not meet the thresholds required to get government funding. So, the thresholds they must meet to be NTD-certified (NTD is the National Transit Database) imply that the data needs to be 95 % accurate within a manually sampled set of trips. 

And the rules are different based on the organization's size: a smaller agency will have to sample 15 trips, and then you need to be accurate within what we call unlinked passenger trips.

But a larger organization, like San Diego or Carlsbad in California, has a higher threshold to get the NTD funding, so they need to be accurate. I think they need to sample 50 trips and be accurate within 95 % of both the ons and offs of the unlinked passenger trips. Also, the passenger miles traveled need to be 95 % or better accurate. 

What we have found is that the current solutions for collecting and collating that data are not able to get these agencies NTD-certified. So that's where we're spending our resources: ingesting that raw data almost directly from the sensors, so to speak, and working to get agencies certified.

Advice and Resources

We've been invested in looking at new features on Timescale. For example, we worked with the Support team to identify the best approach for tiered storage. And then, clearly coupling that with reading the documentation, which is pretty comprehensive and something I have definitely leaned on.

We’d like to thank Brock and all of the folks at Hopthru for sharing their story of how they power real-time dashboards to help government transit agencies make informed decisions about their networks.

We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (@Ana Tavares), and we’ll go from there.

Originally posted

May 15, 2024

Last updated

Dec 01, 2024

Share

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.