Predicting NFL Team Performance with ELO Rating, Home-Field Advantage, and QB Rating

Proposal

library(tidyverse)
library(skimr)
library(dplyr)

Data 1

Introduction and data

  • Identify the source of the data.

https://github.com/fivethirtyeight/data/tree/master/nfl-elo 

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

For several years now, FiveThirtyEight has been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their football ratings date back to 1920. This data is collected following every NFL game, regular and postseason.

  • Write a brief description of the observations.

Each observation is a single NFL game that has occurred since 1920. The data is broken down into columns: date, team names, scores, quarterbacks for both teams, their elo (overall rating), before and after the game. Using the Elo rating system, fivethirtyeight takes strength of schedule into account when calculating ratings, thus providing a more accurate depiction of a team’s skill levels before and after every game they have played in their history.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
  • How have individual team performances(ratings) fluctuated over time? Which NFL teams have been the most or least successful throughout NFL history?
  • How have scores changed over time in the NFL? Are games higher scoring now than they were in 1920? Which eras saw the highest or lowest scoring games? What can explain this?
  • Is home field advantage really that effective? How has one team in particular fared at home vs away? Does their margin of victory/defeat differ?

These research questions are important for various purposes such as sports betting, personal interest, or rules changes.

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is NFL games. For people who are not familiar with this, the NFL is the American football league founded in 1920. It comprises of 32 teams across the country that compete in 17 regular season games, and this season concludes with a single-elimination playoff bracket. In this dataset, the Elo rating system is utilized. Elo is a simple measure of team strength and performance that is calculated based on game-by-game results. This provides an accurate description of a team’s rating before and after every game they have played. For my research questions, I would hypothesize that teams such as the Browns have had the least success throughout their history and their Elo rating has remained quite low. For the others, I would hypothesize that games have become higher scoring over time and playing on one’s home field is very important in determining the outcome of a given game.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Most of the variables in this research question are quantitative. The teams are categorical, but variables such as points scored, elo rating and date are all quantitative and continuous.

Glimpse of data

# add code here
nfl <- read_csv("data/nfl_elo.csv")
Rows: 17379 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): playoff, team1, team2, qb1, qb2
dbl  (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(nfl)
Data summary
Name nfl
Number of rows 17379
Number of columns 33
_______________________
Column type frequency:
character 5
Date 1
numeric 27
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
playoff 16763 0.04 1 1 0 4 0
team1 0 1.00 2 3 0 101 0
team2 0 1.00 2 3 0 108 0
qb1 2162 0.88 7 18 0 648 0
qb2 2162 0.88 7 18 0 670 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 1920-09-26 2023-02-12 1989-11-19 3622

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season 0 1.00 1984.86 26.32 1920.00 1968.00 1989.00 2006.00 2022.00 ▂▂▆▇▇
neutral 0 1.00 0.01 0.08 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
elo1_pre 0 1.00 1503.13 104.83 1119.60 1430.26 1504.84 1578.27 1839.66 ▁▃▇▅▁
elo2_pre 0 1.00 1499.47 104.15 1156.55 1426.66 1500.86 1576.08 1849.48 ▁▅▇▅▁
elo_prob1 0 1.00 0.58 0.17 0.07 0.46 0.60 0.72 0.97 ▁▅▇▇▃
elo_prob2 0 1.00 0.42 0.17 0.03 0.28 0.40 0.54 0.93 ▃▇▇▅▁
elo1_post 0 1.00 1502.84 107.36 1119.60 1427.69 1504.59 1580.89 1849.48 ▁▃▇▅▁
elo2_post 0 1.00 1499.76 106.34 1153.90 1425.08 1500.92 1577.10 1831.46 ▁▅▇▅▁
qbelo1_pre 2162 0.88 1504.29 99.90 1149.70 1434.58 1505.76 1574.54 1806.39 ▁▃▇▆▁
qbelo2_pre 2162 0.88 1502.86 98.55 1152.47 1434.79 1504.38 1574.63 1814.37 ▁▃▇▅▁
qb1_value_pre 2162 0.88 97.06 58.78 -53.78 54.06 91.48 133.56 329.56 ▂▇▆▂▁
qb2_value_pre 2162 0.88 96.97 58.20 -47.29 54.89 91.51 132.80 327.72 ▂▇▆▂▁
qb1_adj 2162 0.88 -2.06 26.84 -242.49 -9.28 1.76 12.25 119.69 ▁▁▁▇▁
qb2_adj 2162 0.88 -2.09 27.46 -235.05 -9.22 1.92 12.21 116.80 ▁▁▁▇▁
qbelo_prob1 2162 0.88 0.58 0.18 0.06 0.45 0.59 0.71 0.97 ▁▅▇▇▃
qbelo_prob2 2162 0.88 0.42 0.18 0.03 0.29 0.41 0.55 0.94 ▃▇▇▅▁
qb1_game_value 2162 0.88 109.96 133.67 -385.74 16.48 107.37 200.17 713.70 ▁▅▇▂▁
qb2_game_value 2162 0.88 89.17 132.09 -413.97 -4.75 84.97 177.93 605.10 ▁▃▇▃▁
qb1_value_post 2162 0.88 98.35 59.03 -46.33 55.10 92.54 134.82 327.72 ▂▇▆▂▁
qb2_value_post 2162 0.88 96.19 58.39 -53.78 53.42 90.80 132.72 329.56 ▂▇▆▂▁
qbelo1_post 2162 0.88 1504.27 102.32 1164.33 1432.90 1505.08 1578.37 1814.37 ▁▃▇▅▁
qbelo2_post 2162 0.88 1502.88 100.79 1149.70 1432.49 1504.98 1574.87 1806.22 ▁▃▇▆▁
score1 0 1.00 21.68 11.22 0.00 14.00 21.00 28.00 72.00 ▅▇▃▁▁
score2 0 1.00 18.83 10.79 0.00 10.00 18.00 26.00 73.00 ▇▇▃▁▁
quality 2162 0.88 47.90 29.33 0.00 21.00 48.00 73.00 100.00 ▇▆▇▆▆
importance 16810 0.03 49.70 31.18 0.00 20.00 51.00 76.00 100.00 ▇▅▆▆▇
total_rating 16810 0.03 48.01 26.40 0.00 26.00 48.00 68.00 100.00 ▆▆▇▇▃

Data 2

Introduction and data

  • Identify the source of the data:

    The dataset is obtained from NYC Open Data

    https://data.cityofnewyork.us/Business/Consumer-Services-Mediated-Complaints/nre2-6m2s

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    This project aims to examine consumer complaints data that has been handled by the DCA Consumer Services Division, with a focus on businesses operating mainly in New York City between 2013 and 2023

  • Write a brief description of the observations.

    The original dataset comprises of 23.7k consumer complaints that were resolved over a two-year period. The complaints are categorized into 17 columns, which include details such as the Business Name, Industry, Complaint Type, Mediation Start Date, Mediation End Date, Complaint Result, Satisfaction, Restitution, Business Building, Business Street, Business Address Unit, Business City, Business State, Business Zip, Complainant Zip, Longitude, and Latitude.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1- What do the distribution of customer complaints and complaint types look like at both the national and regional levels?

    2- How does the number of complaints varies over time?

    3- Are there possible connections between approved restitutions and satisfaction/dissatisfaction?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    This is an analysis of Consumer Complaints Against Businesses Based in New York between 2013 and 2023. It aims to derive a relationship between time and the number of complaints, as well as how this number fluctuates across industries.

    We expect certain industries to have higher rates of complaints than others.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    The variables are both Categorical and Quantitative.

Glimpse of data

business_complaints <- read_csv("data/Consumer_Services_Mediated_Complaints.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 23716 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Business Name, Industry, Complaint Type, Mediation Start Date, Med...
dbl  (3): Restitution, Longitude, Latitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(business_complaints)
Data summary
Name business_complaints
Number of rows 23716
Number of columns 17
_______________________
Column type frequency:
character 14
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Business Name 719 0.97 2 88 0 12466 0
Industry 0 1.00 5 48 0 83 0
Complaint Type 1864 0.92 4 39 0 71 0
Mediation Start Date 1 1.00 10 10 0 1722 0
Mediation Close Date 0 1.00 10 10 0 2290 0
Complaint Result 0 1.00 10 50 0 26 0
Satisfaction 3949 0.83 2 50 0 3 0
Business Building 1670 0.93 1 15 0 4852 0
Business Street 2047 0.91 1 29 0 3338 0
Building Address Unit 19537 0.18 1 10 0 922 0
Business City 1609 0.93 4 20 0 837 0
Business State 1608 0.93 2 17 0 47 0
Business Zip 1608 0.93 2 5 0 1277 0
Complainant Zip 1904 0.92 2 5 0 1816 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Restitution 1 1.00 417.09 2742.08 0.00 0.00 0.00 50.00 132616.00 ▇▁▁▁▁
Longitude 10241 0.57 -73.92 0.08 -74.25 -73.98 -73.93 -73.87 -73.70 ▁▁▇▆▂
Latitude 10241 0.57 40.73 0.08 40.50 40.67 40.73 40.77 40.91 ▁▅▇▇▃

Data 3

Introduction and data

  • Identify the source of the data.

    https://www.gutenberg.org/ebooks/search/?sort_order=downloads

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    Project Gutenberg is a digital library that offers over 60,000 free e-books, mostly in the public domain, which means that their copyrights have expired. The project was launched in 1971 by Michael S. Hart and it aims to encourage the creation and distribution of e-books in different formats for a variety of devices.

  • Write a brief description of the observations.

    Since its release in 1971, it has been keeping track of the number of e-book downloads for all books which signify their popularity. The dataset includes: book title, link, author, author age, number of downloads, date published, and book popularity.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    What is the relationship between book popularity and number of downloads? Does the author’s age when publishing the book and the number of books published / author influence the popularity of books? 

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    It is hypothesized that there is a positive relationship between book popularity and the number of downloads. The more popular a book is, the higher the number of downloads it will receive. Furthermore, it is possible that books published by older authors with a more significant number of books published may be more popular than those published by younger authors or authors with fewer published works.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Categorical and quantitative  

Glimpse of data

library_data <- read_csv("data/project-gutenberg-data.csv")
Rows: 100 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Title, Title_link, Thumbnail, subtitle, extra, Headline, Field1, F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(library_data)
Data summary
Name library_data
Number of rows 100
Number of columns 10
_______________________
Column type frequency:
character 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Title 9 0.91 4 88 0 91 0
Title_link 9 0.91 35 64 0 91 0
Thumbnail 11 0.89 60 66 0 89 0
subtitle 12 0.88 5 33 0 65 0
extra 11 0.89 14 16 0 89 0
Headline 11 0.89 18 83 0 89 0
Field1 11 0.89 4 166 0 87 0
Field2 39 0.61 2 64 0 39 0
Field3 60 0.40 3 55 0 29 0
Field4 11 0.89 9 55 0 67 0

Data 4

Introduction and data

  • Identify the source of the data.

    https://think.cs.vt.edu/corgis/csv/coffee/

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data comes fromhttps://think.cs.vt.edu/corgis/csv/coffee/. It was created on 10/28/2022. Sam Donald collected the data comes from Coffee Quality Database courtesy of Buzzfeed Data Scientist James LeDoux. It’s now under the website CORGIS: The Collection of Really Great, Interesting, Situated Datasets.

  • Write a brief description of the observations.

    There are 990 observations, representing each individual coffee bean. Each coffee bean is of a specific type, from a specific region, under a specific treatment. The dataset records the information such as location, year, species, etc. It also contains the scores of each coffee bean, evaluating its flavor, sweetness, moisture, etc.

Research Question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • What combinations of location and processing method makes a good coffee?

    • How does Arabica coffee bean scores from different region differ?

    • How does the coffee bean’s location and altitude influence its score?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic is factors influencing coffee bean scores. Coffee is a necessity in our life, especially for college students. It boosts our energy level. Seeing different types of coffee in the coffee shop, we usually face difficulty in selecting the “best” coffee according to our preferences. The dataset consists of a variety of categories to measure the score of the coffee bean, which acts as a helpful indicator in our daily life. It also records information such as types, locations, and processing methods of coffee beans. So we would like to know how these factors influence the coffee bean score. We hypothesize that location is one of the most important factors influencing the coffee bean flavor and its score.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    The independent variables consist of both categorical (country, species, processing method, etc.) and quantitative data (location altitude, production amount, year, etc). The dependent variables are scores — total scores, sweet scores, flavor scores, etc.

Glimpse of data

coffee_data <- read_csv("data/coffee.csv")
Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(coffee_data)
Data summary
Name coffee_data
Number of rows 989
Number of columns 23
_______________________
Column type frequency:
character 7
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Location.Country 0 1 4 28 0 32 0
Location.Region 0 1 3 76 0 278 0
Data.Owner 0 1 3 50 0 263 0
Data.Type.Species 0 1 7 7 0 2 0
Data.Type.Variety 0 1 3 21 0 28 0
Data.Type.Processing method 0 1 3 25 0 6 0
Data.Color 0 1 4 12 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Location.Altitude.Min 0 1 1640.08 9192.52 0 905.00 1300.00 1550.00 190164.00 ▇▁▁▁▁
Location.Altitude.Max 0 1 1675.93 9191.96 0 950.00 1310.00 1600.00 190164.00 ▇▁▁▁▁
Location.Altitude.Average 0 1 1658.00 9192.06 0 950.00 1300.00 1600.00 190164.00 ▇▁▁▁▁
Year 0 1 2013.55 1.66 2010 2012.00 2013.00 2015.00 2018.00 ▁▇▃▃▁
Data.Production.Number of bags 0 1 151.76 125.67 1 15.00 170.00 275.00 600.00 ▇▁▇▁▁
Data.Production.Bag weight 0 1 210.49 1666.71 0 1.00 60.00 69.00 19200.00 ▇▁▁▁▁
Data.Scores.Aroma 0 1 7.57 0.40 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Flavor 0 1 7.52 0.42 0 7.33 7.50 7.75 8.83 ▁▁▁▁▇
Data.Scores.Aftertaste 0 1 7.39 0.43 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
Data.Scores.Acidity 0 1 7.54 0.40 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Body 0 1 7.51 0.39 0 7.33 7.50 7.67 8.50 ▁▁▁▁▇
Data.Scores.Balance 0 1 7.50 0.43 0 7.33 7.50 7.75 8.58 ▁▁▁▁▇
Data.Scores.Uniformity 0 1 9.82 0.59 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Sweetness 0 1 9.83 0.69 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Moisture 0 1 0.09 0.04 0 0.10 0.11 0.12 0.28 ▃▇▆▁▁
Data.Scores.Total 0 1 81.97 3.86 0 81.08 82.50 83.58 90.58 ▁▁▁▁▇