PostgreSQL

Jan 28, 2025

Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL

Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL

Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics—and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience—all with PostgreSQL, standard SQL, and freely available extensions.

Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive-scale observability data with tools like Prometheus.

You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.

Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?

via GIPHY

For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).

Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.

So, at first glance, football game metrics might not immediately jump out as anything special.

But then the NFL did something pretty ambitious and amazing.

All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.

As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.

This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for time series.

We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.

Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.

If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.

If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.

The NFL Time Series Dataset


Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.

Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass-play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)

(Keep watching the NFL website for more information on the next Big Data Bowl.)

Accessing the Data


For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.

This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.

If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale instance.)

Let's Start Exploring Time-Series Insights!


We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.

The NFL dataset includes the following data:

  • Games: all relevant data about each game of the regular season, including date, teams, time, and location
  • Players: information on each player, including what team they play for and their originating college
  • Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.
  • Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the tracking table and totals almost 20 million rows of time-series data.)

In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).

Because TimescaleDB is PostgreSQL with time-series superpowers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.

Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!

The Power of SQL

Year after year, we see SQL listed as one of the most popular languages among developers on the Stack Overflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.

SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.

Not so! SQL can data munge with the best of them! Let's look at a first, quick example.

Average yards per position, per game

For this first example, we'll query the tracking table (the player movement data from all 17 weeks of games) and join to the game table to determine the number of yards per player position, per game.

The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.

WITH total_position_yards AS (
	SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;


Number of plays by offensive player

As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.

This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;
player_iddisplay_nameplay_countteam_name
2506109Ben Roethlisberger725PIT
2558149JuJu Smith-Schuster691PIT
2533031Andrew Luck683IND
2508061Antonio Brown679PIT
310Matt Ryan659ATL
2506363Aaron Rodgers656GB
2505996Eli Manning639NYG
2543495Davante Adams630GB
2540158Zach Ertz629PHI
2532820Kirk Cousins621MIN
79860Matthew Stafford619DET
2504211Tom Brady613NE

If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.

Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;

So, now we can see the non-quarterbacks who are on offense the most in a season:

player_iddisplay_nameplay_countteam_nameposition
2558149JuJu Smith-Schuster691PITWR
2508061Antonio Brown679PITWR
2543495Davante Adams630GBWR
2540158Zach Ertz629PHITE
2541785Adam Thielen612MINWR
2543468Mike Evans610TBWR
2555295Sterling Shepard610NYGWR
2540169Robert Woods604LAWR
2552600Nelson Agholor604PHIWR
2543488Jarvis Landry592CLEWR
2540165DeAndre Hopkins587HOUWR
2543498Brandin Cooks581LAWR

Sack percentage by quarterback on passing plays

We can start to go a little deeper by extracting specific data from the tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.

Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).

Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team.

To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.

First, we select the distinct list of all plays, for each quarterback (qb_plays). The reason we do a SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.

With this result, we can then count the number of total plays per quarterback (total_qb_plays), the total number of games each quarterback played (qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (sacks).

With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.

WITH qb_plays AS (
	SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
	FROM tracking t 
	WHERE POSITION = 'QB'
),
total_qb_plays AS (
	SELECT count(*) play_count, player_id FROM qb_plays
	GROUP BY player_id
),
qb_games AS (
	SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
	GROUP BY player_id
),
sacks AS (
	SELECT count(*) sack_count, player_id 
	FROM play p
	INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
	WHERE p.passresult = 'S'
	GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;

If you're an ardent football fan, the results from 2018 probably don't surprise you.

play_countgame_countsack_countsack_percentagedisplay_name
579166511.23Deshaun Watson
60216559.14Dak Prescott
61116538.67Derek Carr
65616497.47Aaron Rodgers
462154810.39Russell Wilson
63916477.36Eli Manning
448144510.04Josh Rosen
65916436.53Matt Ryan
386144311.14Marcus Mariota
61916416.62Matthew Stafford
62115386.12Kirk Cousins
324113711.42Ryan Tannehill
44711368.05Carson Wentz

Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.

play_countgame_countsack_countsack_percentagedisplay_name
72516253.45Ben Roethlisberger
68216223.23Andrew Luck
61316213.43Tom Brady

Now, let’s try some more “advanced” queries and analyses.

Faster Insights With PostgreSQL and TimescaleDB

So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.

Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.

This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.

Faster Queries With TimescaleDB Continuous Aggregates

We noticed that we often needed to build queries that started with the tracking table, filtering data by specific players, positions, and games. Part of the reason is that the play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the tracking table to identify the players who were involved in any given play.

The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.

This means that every row in tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!

On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).

Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)

CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;

After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.

WITH total_position_yards AS (
	SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;

We get the same result, but now the query runs in 100ms - 800x faster!

Advanced SQL Data Analysis With TimescaleDB Hyperfunctions

Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.

It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

Grouping data into percentiles

The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.

As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;
positionmean_yardsmedian_yards
FS595.583433048431626.388099960848
CB572.3336749867212592.2175990890378
WR552.6508570179277555.5030569048633
S530.6436781609186550.5961518474892
SS522.5604103343453551.1296628916651
MLB462.70229007633407490.77906906009343
ILB402.7882871125599403.3779668359464
OLB393.40014271151847390.6742117791442
QB334.7025466893028352.1192705472368
LB328.9812527472519257.72003396053884
TE327.9515596330271257.72003396053884

Finding extreme outliers

Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
	SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
	FROM sum_yards 
	GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
	FROM sum_yards a
	LEFT JOIN position_percentile pp ON a.POSITION = pp.position
	WHERE yards >= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;
positiondisplay_nameyardsp95
FSEric Weddle13869.75999999999712320.288323166456
FSAdrian Amos12989.43999999996612320.288323166456
FSTyrann Mathieu12565.21999999995612320.288323166456
QBAaron Rodgers7422.359999999956667.51452813257
QBPatrick Mahomes6985.9899999999526667.51452813257
QBMatt Ryan6759.9599999999696667.51452813257
TEZach Ertz13124.5899999999510667.986199523099
TEJimmy Graham12693.67999999998210667.986199523099
TETravis Kelce12218.12999999995710667.986199523099
TEDavid Njoku11502.15999999996510667.986199523099
TEGeorge Kittle11058.09999999997510667.986199523099
TEKyle Rudolph10761.94999999996810667.986199523099
TEJared Cook10678.2299999999810667.986199523099
WRAntonio Brown16877.55999999996514271.23409723974
WRBrandin Cooks15510.0199999999514271.23409723974
WRJuJu Smith-Schuster15492.7699999999614271.23409723974
WRRobert Woods15253.17999999995814271.23409723974
WRNelson Agholor15180.3299999999714271.23409723974
WRTyreek Hill15106.60999999997314271.23409723974
WRZay Jones14790.58999999996714271.23409723974
WRSterling Shepard14673.7999999999614271.23409723974
WRMike Evans14620.12999999998314271.23409723974
WRDavante Adams14574.50999999995114271.23409723974
WRKenny Golladay14354.49999999997314271.23409723974
WRJarvis Landry14281.50999999997114271.23409723974

Where Can the Data Take You?

As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.

We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.

To get started with the NFL data:

  • Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
  • Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.

If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.

🙏 We’d like to thank the NFL for making this data available and the millions of passionate fans around the world who make the NFL such an exciting game to watch.

And, Geaux Saints 🏈!

Originally posted

Jul 27, 2021

Last updated

Jan 28, 2025

Share

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.