Predicting NFL Team Performance with ELO Rating, Home-Field Advantage, and QB Rating

Exploratory data analysis

library(readr)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ dplyr   1.1.2
✔ tibble  3.2.1     ✔ stringr 1.5.0
✔ tidyr   1.2.1     ✔ forcats 0.5.2
✔ purrr   1.0.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(scales)
library(dplyr)
library(lubridate)

Loading required package: timechange

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(tidyr)

Research question(s)

How have individual team performances(ratings) fluctuated over time? Which NFL teams have been the most or least successful throughout NFL history?
How well does this ELO rating predict regular season / postseason success?
How much does a quarterback’s ELO score contribute to a team’s overall ELO rating?
Does having the home field really give a team an advantage? Is this advantage different or more pronounced in the playoffs?

Data collection and cleaning

#| label: data-collection
nfl_data <- read_csv("data/nfl_elo.csv")

Rows: 17379 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): playoff, team1, team2, qb1, qb2
dbl  (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

active_teams <- c(
  "ARI", "ATL", "BAL", "BUF", "CAR", "CHI", "CIN",
  "CLE", "DAL", "DEN", "DET", "GB", "HOU", "IND",
  "JAX", "KC", "MIA", "MIN", "NE", "NO", "NYG",
  "NYJ", "LV", "PHI", "PIT", "LAC", "SF", "SEA",
  "LAR", "TB", "TEN", "WAS"
)

nfl_data_clean <- nfl_data |>
  mutate(
    winning_team = if_else(score1 > score2, team1, team2),
    playoff = if_else(is.na(playoff), "R", playoff),
    home_win = if_else(winning_team == team1, TRUE, FALSE)
  ) |>
  group_by(team1) |>
  filter(season >= 1950, team1 %in% active_teams, team2 %in% active_teams) |>
  arrange(team1)

select(nfl_data_clean, -total_rating, -importance)

# A tibble: 13,056 × 33
# Groups:   team1 [30]
   date       season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
   <date>      <dbl>   <dbl> <chr>   <chr> <chr>    <dbl>    <dbl>     <dbl>
 1 1950-09-24   1950       0 R       ARI   PHI      1554.    1632.     0.482
 2 1950-10-29   1950       0 R       ARI   NYG      1525.    1529.     0.587
 3 1950-11-05   1950       0 R       ARI   CLE      1547.    1679.     0.405
 4 1950-11-23   1950       0 R       ARI   PIT      1536.    1504.     0.636
 5 1950-12-03   1950       0 R       ARI   CHI      1503.    1668.     0.360
 6 1951-09-30   1951       0 R       ARI   PHI      1506.    1585.     0.479
 7 1951-10-07   1951       0 R       ARI   CHI      1492.    1596.     0.444
 8 1951-10-28   1951       0 R       ARI   PIT      1489.    1467.     0.623
 9 1951-11-04   1951       0 R       ARI   CLE      1454.    1678.     0.286
10 1951-11-25   1951       0 R       ARI   NYG      1456.    1601.     0.387
# ℹ 13,046 more rows
# ℹ 24 more variables: elo_prob2 <dbl>, elo1_post <dbl>, elo2_post <dbl>,
#   qbelo1_pre <dbl>, qbelo2_pre <dbl>, qb1 <chr>, qb2 <chr>,
#   qb1_value_pre <dbl>, qb2_value_pre <dbl>, qb1_adj <dbl>, qb2_adj <dbl>,
#   qbelo_prob1 <dbl>, qbelo_prob2 <dbl>, qb1_game_value <dbl>,
#   qb2_game_value <dbl>, qb1_value_post <dbl>, qb2_value_post <dbl>,
#   qbelo1_post <dbl>, qbelo2_post <dbl>, score1 <dbl>, score2 <dbl>, …

write.csv(nfl_data_clean, file = "data/nfl_data_clean.csv")

Filter out teams that are no longer active in the NFL
Add new variable for winning team
Filter years so that only years with all values are included. This is done by filtering for only seasons in 1950 or earlier.
Replace all NA values for the playoff column with “r”.
Remove columns total_rating and importance as these statistics were only recorded starting in 2021 and are not relevant to data analysis.
Break main dataset into smaller datasets for research question. This will allow us to select specific columns that are relevant to each question.

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

Data description

We analyzed the NFL Elo dataset provided by FiveThirtyEight, which contains historical data on NFL games and Elo ratings for each team. The dataset covers games from the beginning of the league in 1920 through the 2021 season. Elo ratings are a measure of a team’s strength and are used to predict the outcome of games based on each team’s relative skill level.

In addition to the Elo ratings, the dataset includes information about the game date, season, teams involved, where the game is played, playoff status, and the actual scores for each team. The dataset also contains advanced statistics such as quarterback Elo ratings.

Our focus is to analyze how a team’s Elo rating can predict their winning probability during the regular season, playoffs, and Super Bowl. In doing so, we will consider additional factors such as the location of the game and the quarterback ratings. By incorporating these elements, we aim to gain a deeper understanding of the factors influencing game outcomes, allowing us to draw more informed and nuanced conclusions.

Data limitations

The dataset does include ELO values as well as other relevant information needed to address the majority of the research questions. One of the research questions pertains to finding the contribution of a quarterback’s ELO score to a team’s overall ELO rating. It is easier to establish a correlative relationship and more difficult to use the data to prove the extent of contribution. In this way, it is challenging to precisely answer the research question with the data. Otherwise, it seems plausible to address the remaining research questions with the dataset.

Exploratory data analysis

nfl_data_homewins <- nfl_data_clean |>
  group_by(season) |>
  summarize(pct_home_win = sum(home_win == TRUE) / n())

ggplot(nfl_data_homewins, aes(x = season, y = pct_home_win)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Season",
    y = "Percentage of Home Team Wins",
    title = "How have home teams performed over time?"
  ) +
  scale_y_continuous(labels = label_percent())

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Do 4 research questions seem holistic for the research and research goal?
How to minimize the negative impact caused by the limitation of dataset?