National Football League Team Statistics

linear regression
principal component analysis
missing data
variable selection
regularization
NFL team statistics from 1999 to 2022
Author

Ron Yurko

Published

May 24, 2023

Motivation

The National Football League (NFL) is the top professional American football league in the world. While a team’s record ultimately determines whether or not they make the playoffs, their score differential (points for - points against) is often a better indicator of a team’s ability. But what aspects of a team’s performance are related to their point differential? Is passing more important than rushing? What about offense in comparison to defense? The NFL records a variety of statistics, and the public NFL analytics community have developed advanced metrics such as expected points added (EPA) that provide deeper insight into a team’s performance. With this dataset of statistics dating back to 1999, you can explore variation between teams since as well as which types of statistics are relevant predictor variables of record and point differential.

Data

This dataset contains statistics about the regular season performance for each NFL team from 1999 to 2022 team. The data was collected using the nflreadr package in R.

Each row in the dataset corresponds to a single NFL team in a single regular season. There are a total of 765 team-seasons, with 56 total columns. The column names are organized below by the type of information they contain, with the first set of columns being self-explanatory:

nfl-team-statistics.csv
Variable Description
season Regular season year of team’s statistics
team NFL team three letter abbreviation

There are also columns with season level outcomes:

Variable Description
points_score Total number of points scored by the team
points_allowed Total number of points allowed by the team
wins Number of games the team won
losses Number of games the team lost
ties Number of games the team tied
score_differential points scored - points allowed

There are also several columns corresponding to offensive and defensive summaries of the team’s performance in the season separated by play type (either pass or run):

Variable Description
offense/defense_completion_percentage Passing completion percentage either for (offense) or against (defense)
offense/defense_total_yards_gained_pass/run Total number of yards gained (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_yards_gained_pass/run Average number of yards gained (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_air_yards Total number of air yards gained (offense) or allowed (defense), where air yards correspond to perpendicular yards traveled from the line of scrimmage to location of catch for passing plays
offense/defense_ave_air_yards Average number of air yards gained (offense) or allowed (defense) per passing play
offense/defense_total_yac Total number of yards after catch gained (offense) or allowed (defense)
offense/defense_ave_yac Average number of yards after catch gained (offense) or allowed (defense) per passing play
offense/defense_n_plays_pass/run Total number of plays by the team (offense) or against (defense) by play type (pass or run)
offense/defense_n_interceptions Total number of interceptions thrown (offense) or caught (defense)
offense/defense/n_fumbles_lost_pass/run Total number of fumbles lost (offense) or forced (defense) by play type (pass or run)
offense/defense_total_epa_pass/run Total expected points added (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_epa_pass/run Average expected points added (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_wpa_pass/run Total win probability added (offense) or allowed (defense) by play type (pass or run)
offense/defense_ave_wpa_pass/run Average win probability added (offense) or allowed (defense) per play by play type (pass or run)
offense/defense_total_epa_pass/run Total expected points added (offense) or allowed (defense) by play type (pass or run)
offense/defense_success_rate_pass/run Proportion of plays with positive expected points added (offense) or allowed (defense) by play type (pass or run)

The EPA variables are advanced NFL statistics, conveying how much value a team is adding over the average team in a given situation. It’s on a points scale instead of the typically used yards, because not all yards are created equal in American football (10 yard gain on 3rd and 15 is much less valuable than a 2 yard gain on 4th and 1). For offensive stats the higher the EPA the better, but for defensive stats the lower (more negative) the EPA the better. The WPA variables are similar except they are measuring play value in terms of win probability.

Questions

  1. Build a linear regression model to predict score differential as a function of the team’s statistics. Which variables, if any, are predictive of score differential in a season? Are certain types of variables more predictive? (e.g., passing versus rushing,) Describe their relationships.

  2. Using principal component analysis, project the team seasons into a lower dimensional space using only the team statistics that do not include points or record. Choose an appropriate number of principal components, and describe which statistics contribute to your selected number of components. Based on your results, which team statistics explain the most variation between team-seasons in the dataset? Which team statistics are independent of each other?

References

Ho T, Carl S (2022). nflreadr: Download ‘nflverse’ Data. R package version 1.3.1, https://CRAN.R-project.org/package=nflreadr.