Brian Kane March 22, 2021
Snowflake recently launched the Data Marketplace Challenge, a competition to answer a data-driven challenge question by leveraging third party data made available through Snowflake. At SeekWell, we love Snowflake and also love exploring data so it was a no-brainer to submit an entry. The submissions are now closed, but the four questions you could choose from were in the categories of Financial Services, Healthcare, Media/Entertainment, and Retail/CPG. I chose the media question, which was based on the ThoughtSpot Fantasy Football dataset.
The question I tried to answer is the following:
Which professional football quarterback is least likely to throw an interception during the 2021 regular season and why?
The dataset provided by Snowflake to answer this question contains information on every play in the 2020 season—like whether it was a sack, completion, interception, etc. To answer the challenge question I first broke it up into two parts:
1) For each NFL quarterback, what is the chance of throwing an interception on any given pass?
2) For each NFL quarterback, how many passes are they expected to throw in the 2021 season?
Once I have these two data points for every quarterback, I can calculate their chance of throwing an interception (p) with a fairly simple formula:
In this formula, interceptionRate is their chance of throwing an interception on a given pass, and n is the (expected) number of passes that QB will make. Say their expected interception rate is 2% and they are expected to make 300 passes next season. Then that probability would evaluate to a 99.77% chance of throwing an interception (1- ((.98)^300)).
The quarterback with the lowest value of p (chance of throwing at least one interception) is the answer to the question. Others may approach the question differently, but this is how I decided to go about it. Now to calculate those two values for all 131 quarterbacks in the dataset to get my answer!
Expected Interception rate for each NFL quarterback
The interception for each quarterback with a significant number of passes (I'll use a minimum of 30 passes as the cutoff) is simple to calculate: number of interceptions divided by the number of passes.
A query for this rate for each quarterback in the ThoughSpot dataset would look like:
with passing as ( select r.team, r.full_name,r.status,r.gsis_id, sum(case when p.sack < 1 and p.fumble_lost < 1 then 1 else 0 end) as attempts, count(p.*) as dropbacks, sum(p.sack) as sacks, sum(p.interception) as interceptions, sum(p.touchdown) as touchdowns, sum(p.incomplete_pass) as incompletions, sum(p.yards_gained) as yards_gained, sum(p.fumble_lost) as fumbles_lost from nfl2020.roster as r inner join nfl2020.pbp_passing as p on r.gsis_id = p.gsis_id where r.position = 'QB' group by 1,2,3,4 ) select team, full_name, attempts, interceptions/attempts as interception_rate from passing where attempts>30
And results would look like this:
It's no surprise that Aaron Rodgers and Patrick Mahomes have the lowest interception rates among QB's who played the whole season. The only issue here is that there are three quarterbacks who threw no interceptions despite throwing over 30 passes, meaning they had an interception rate of 0%. This throws a wrench into the analysis since you would expect them to throw zero interceptions regardless of how many passes they throw, which is clearly unrealistic.
Without diving too deep down the Bayesian rabbit hole, a simple way to normalize the interception rate for these QBs would be to average their interception rate with the overall league-wide interception rate (.022), weighted by the proportion of average season attempts (561) they took. If they took 140 attempts—or 1/4 of the average attempts for a quarterback in the season—then 1/4 of their interception rate would be their actual interception rate, and the other 3/4 would be the average interception rate. This gives roughly a 2% interception rate for Chade Henne and Ben DiNucci, and 1.8% interception rate for C.J. Beathard. Still better than average, but not zero.
Now that we've got a non-zero interception rate for every quarterback with at least 30 passes thrown, it's time to estimate the interception rate for everyone else. This is tricky since we don't have as much data on the other QB's, and intuitively, they should have a worse interception rate than the QB's who played more, just based on the fact that better QB's tend to play more.
The QB's throwing < 30 passes had an average interception rate of 3.5%, so I'll start there. To simplify further, I'll lump all the QBs who never played with the QBs who threw fewer than 30 attempts. Maybe the QBs who never played should actually have a worse interception rate in reality than the QBs who threw up to 29 passes, but I'll give them the benefit of the doubt.
Since these QBs don't have enough passes for that to be a useful predictor, I looked at other data points in the ThoughtSpot dataset like age, weight, and height. Height seemed most predictive of interceptions, so I ran a linear regression on height (in inches) and interception rate and found a slope of -.00092. This means that a quarterback who is 6"0 will have a .055% higher interception rate than a quarterback who is 6"6. That's not crazy—we've seen successful short quarterbacks before—but clearly height has an affect on interception rate so this seems reasonable.
To do this, I used the REGR_SLOPE() and REGR_INTERCEPT() functions in Snowflake. The full query for running the regression looks like below (after storing the interception rates in a CTE called "over_30_interception_rates"):
with player_heights as ( select split_part(height,'-',1)::int as feet, split_part(height,'-',2)::int/12 as inches, split_part(height,'-',1)::int + split_part(height,'-',2)::int/12 as height, r.full_name, r.team, r.gsis_id from nfl2020.roster as r where r.position = 'QB' ), height_interception_rate as ( select r.height, ps.interception_rate as interception_rate, r.full_name, r.team from player_heights as r inner join over_30_interception_rates as ps on ps.gsis_id = r.gsis_id ) select REGR_SLOPE(h.interception_rate, h.height) as height_interception_slope, regr_intercept(h.interception_rate,h.height) as height_interception_int from height_interception_rate as h
Adjusting the intercept so the average for <30 attempt QBs is 3.5% and applying that slope, we get our interception rate for all the <30 attempt QB's. This chart shows the rough relationship between height (left axis) and expected interception rate (right axis).
Combining that with the interception rate calculated for the rest of the QB's, the chart of all expected interception rates looks like below.
The highest interception rates on the far left all had enough passes last season to qualify and threw a lot of interceptions. Most of the plateau in the middle are players who didn't pass enough last season, so those average around 3.5% and are higher or lower depending on the player's height. It's not perfect, but it's a good approximation given we only have the 2020 regular season data to work with.
Expected number of passes for each quarterback
Now that we have expected interception rate, we need expected number of passes thrown in the 2021 season to complete the equation. A simple way to do this would be to assume everyone will have the same proportion of attempts for their team in 2021 as they did in 2020. But this falsely assumes players who didn't get a chance last season definitely won't get to play this season. It also unfairly "punishes" players who got injured last season, even if there's no reason to believe that injury would persist or lead to other injuries this season (e.g. Dak Prescott, who played less than half of Cowboy's snaps last season, but will likely play the majority of their snaps this season).
So instead, for all quarterbacks with > 30 attempts, I'm going to predict the likelihood they get injured as well as the likelihood they get benched for performance reasons. Then I'll multiply those two values to get the proportion of snaps they're likely to take as quarterback. The equation to calculate expected snaps (e) for qualifying quarterbacks is as follows:
The quarterbacks who didn't have at least 30 attempts will get the rest of the team's snaps for next season, split evenly among them. One issue with this strategy is that it glosses over the fact that "leftover" snaps are more likely to go to the backups higher up on the depth chart. But there's no depth chart ranking available in the dataset so I have to assume all backups who didn't play last season are equally likely to play next.
Now to calculate chance of being injured and chance of being benched for each quarterback!
Predicting chance of being injured
I found that whether a player gets injured (where status = "Injured Reserve" in the Roster table of the dataset) can best be predicted by the number of times they were sacked. Age, hits, and previous injuries also probably play a role but I don't have access to injury data from prior seasons and, well, Tom Brady proves that age isn't all that important anymore.
Predicting injured status with sacks using the REGR_SLOPE() function in Snowflake, you get a .57% higher chance of getting injured for each additional sack, which seems reasonable. Roughly, for every 10 sacks you're 5% more likely to get injured. To simplify, I'll assume zero sacks means you have a 0% chance of being injured (obviously you can still get injured without getting sacked but it makes the calculations cleaner).
This chart shows average sacks per attempt for each quarterback, and their resulting chance of getting injured (assuming they play all of their team's pass plays).
Ryan Finley got sacked on 28% of his attempts, which if he kept that up for a full season, would give him an 89% chance of being injured with this formula. Since that seems extreme, I'll max out chance of injury at 50%. Ben Roethlisberger has the lowest sack rate and so only has a 6.8% chance of getting injured.
Predicting chance of being benched for performance
To predict the probability of being benched for each QB, I averaged performance rank for qualifying quarterbacks (attempts>30) using three categories of performance: touchdowns per attempt, yards per attempt, and turnovers per attempt. The query to calculate this from the ThoughtSpot dataset is below.
with passing as ( select r.team, r.full_name,r.status,r.gsis_id, sum(case when p.sack < 1 and p.fumble_lost < 1 then 1 else 0 end) as attempts, count(p.*) as dropbacks, sum(p.sack) as sacks, sum(p.interception) as interceptions, sum(p.touchdown) as touchdowns, sum(p.incomplete_pass) as incompletions, sum(p.yards_gained) as yards_gained, sum(p.fumble_lost) as fumbles_lost from nfl2020.roster as r inner join nfl2020.pbp_passing as p on r.gsis_id = p.gsis_id where r.position = 'QB' group by 1,2,3,4 ), passing_stats as ( select p.team, p.full_name, touchdowns/attempts as td_rate, (fumbles_lost+interceptions)/dropbacks as turnover_rate, yards_gained/attempts as yards_per_attempt, interceptions/dropbacks as interception_rate,p.gsis_id,attempts from passing as p inner join nfl2020.roster as r on p.gsis_id = r.gsis_id where attempts>30 and (r.status <> 'Injured Reserve') ), passing_ranks as ( select full_name, row_number() over ( order by td_rate desc) as td_rank, row_number() over ( order by turnover_rate desc) as turnover_rank, row_number() over ( order by yards_per_attempt desc) as yards_per_attempt_rank,attempts from passing_stats ) select *, (td_rank+ (53-turnover_rank)+ yards_per_attempt_rank)/3 as avg_rank from passing_ranks
It's not quite as precise as the QB rating—which unfortunately is not in the dataset—but it's a decent summation of a QB's performance. The highest ranking QB's using this method are below, with Aaron Roders, Deshaun Watson and Patrick Mahomes coming out on top:
Using REGR_SLOPE() again, I found a slope of -.82 between percentile rank (with 0 being the best) and percent of team snaps taken. In other words, a quarterback ranked at the 10th percentile of all QBs (e.g. better than 90% of QB's) will likely play 8% more offensive plays than a quarterback ranked at the 20th percentile, all else being equal. This seems logical, but clearly isn't perfect since your chance of being benched also depends on how good or bad your backup is. Regardless, I think this is a decent estimate.
The top 20% of starters all get essentially a 0% chance of being benched for performance. Meanwhile someone who averages closes to the bottom 10% of quarterbacks, like Ben DiNucci, has a 65% chance of being benched for performance, and should therefore only expect to play 35% of team snaps in a season if they start. Expected snap proportion and average performance percentile can be seen in the chart below:
Total Expected Snaps
Multiplying their chance of being injured and their chance of being benched gives us the expected number of snaps for each quarterback who played > 30 snaps last season. To translate the player's expected percent of snaps into the full distribution of the team's snaps, the starting quarterback (which I just assume to be the quarterback with the highest expected snaps) gets their full expected snaps, whereas a qualifying backup will get their expected proportion of snaps multiplied by the remaining proportion of snaps. Then, the rest of the backups get the rest of the snaps split between them.
This table shows the percent of snaps the top quarterbacks are likely to take. Drew Brees is highest which points to a clear flaw with this methodology: this ignores strategically using other quarterbacks even though your starter may still be healthy and much better. Drew Brees (notwithstanding that he's retiring) is really good and doesn't get sacked often so he should be unlikely to miss playing time. But in practice, Drew Brees is sometimes swapped out for Taysom Hill as part of the Saints strategy. Nonetheless, these still seem fair. If you're a backup for any of these quarterbacks, and they don't retire, don't expect to play all that much next season:
After taking the first and second string expected snaps for each team, and then dividing the remaining snaps between the rest of the backups, you get a chart that looks like below. The tall bars are the starters, and the smaller bars are the expected snaps leftover for the rest of the QB's. Each team's expected snap percentage adds up to 100%:
Answering the Question
For simplicity, I assumed each team would have the same number of passing attempts next season (561). After plugging in each QB's expected pass attempts and expected interception rate into the "interception chance" equation from the beginning, I get the table below showing the chance of throwing an interception for each player.
The three backups in Pittsburgh have the least likely chance of playing next season, owing to both to Ben Roethlisberger and Mason Rudolph being expected to play a lot next season, leaving very few expected snaps for everyone else. Joshua Dobbs and Nick Schuessler get the advantage because they are shorter and therefore slightly more likely to throw an interception than Devlin Hodges.
The table below shows how the expected snaps and chance of interception were calculated for the Pittsburgh quarterbacks. Both Roethlisberger and Rudolph are good quarterbacks who don't get sacked much so they are expected to play 84% and 83% of snaps respectively. Since Roethlisberger is the starter, he gets 84% of total expected snaps while Rudolph, being the second string, gets 83% of the remaining snaps, or about 13% of total snaps. That leaves only ~3% of Pittsburghs snaps to the other 3 backups, so each get about 1% of total snaps or 5 snaps each. Given their expected interception rates, Dobbs and Schuessler get the lowest chance of throwing an interception next year of 16.1%!
The backups in Kansas City also have a really low chance of throwing an interception due to very small probability of Patrick Mahomes getting benched or hurt. I guess the takeaway is, if you don't want to throw an interception in the NFL, be a backup for good and injury-proof QB's, and stay as far down in the depth chart as you can.
The Data Marketplace Challenge from Snowflake gave me a great opportunity to explore a fun dataset from ThoughtSpot and also do a postmortem on the failed QB situation of the Philadelphia Eagles, my personal favorite team. Carson Wentz, our now former franchise quarterback, has about a 94% chance of throwing an interception next year. It will be bittersweet to see him throw interceptions (and hopefully touchdowns too) in a Colts uniform next year.
With more data (and time) it would be interesting to explore how other factors relate to injuries in QBs, like age, weight, prior injuries, and lifetime sacks and hits. I'm looking forward to see what datasets get added to Snowflake's Data Marketplace in the future.
If you're interested in the SQL code used for this analysis, you can download it here:
Why not give SeekWell a go? You can start with our 14-day free trial.