Predicting NFL Team Performance with ELO Rating, Home-Field Advantage, and QB Rating

Proposal

library(tidyverse)
library(skimr)
library(dplyr)

Data 1

Introduction and data

Identify the source of the data.

https://github.com/fivethirtyeight/data/tree/master/nfl-elo

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

For several years now, FiveThirtyEight has been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their football ratings date back to 1920. This data is collected following every NFL game, regular and postseason.

Write a brief description of the observations.

Each observation is a single NFL game that has occurred since 1920. The data is broken down into columns: date, team names, scores, quarterbacks for both teams, their elo (overall rating), before and after the game. Using the Elo rating system, fivethirtyeight takes strength of schedule into account when calculating ratings, thus providing a more accurate depiction of a team’s skill levels before and after every game they have played in their history.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How have individual team performances(ratings) fluctuated over time? Which NFL teams have been the most or least successful throughout NFL history?
How have scores changed over time in the NFL? Are games higher scoring now than they were in 1920? Which eras saw the highest or lowest scoring games? What can explain this?
Is home field advantage really that effective? How has one team in particular fared at home vs away? Does their margin of victory/defeat differ?

These research questions are important for various purposes such as sports betting, personal interest, or rules changes.

A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is NFL games. For people who are not familiar with this, the NFL is the American football league founded in 1920. It comprises of 32 teams across the country that compete in 17 regular season games, and this season concludes with a single-elimination playoff bracket. In this dataset, the Elo rating system is utilized. Elo is a simple measure of team strength and performance that is calculated based on game-by-game results. This provides an accurate description of a team’s rating before and after every game they have played. For my research questions, I would hypothesize that teams such as the Browns have had the least success throughout their history and their Elo rating has remained quite low. For the others, I would hypothesize that games have become higher scoring over time and playing on one’s home field is very important in determining the outcome of a given game.

Identify the types of variables in your research question. Categorical? Quantitative?

Most of the variables in this research question are quantitative. The teams are categorical, but variables such as points scored, elo rating and date are all quantitative and continuous.

Glimpse of data

# add code here
nfl <- read_csv("data/nfl_elo.csv")

Rows: 17379 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): playoff, team1, team2, qb1, qb2
dbl  (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(nfl)

Data summary
Name	nfl
Number of rows	17379
Number of columns	33
_______________________
Column type frequency:
character	5
Date	1
numeric	27
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
playoff	16763	0.04	1	1	4
team1	0	1.00	2	3	101
team2	0	1.00	2	3	108
qb1	2162	0.88	7	18	648
qb2	2162	0.88	7	18	670

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	1920-09-26	2023-02-12	1989-11-19	3622

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
season	0	1.00	1984.86	26.32	1920.00	1968.00	1989.00	2006.00	2022.00	▂▂▆▇▇
neutral	0	1.00	0.01	0.08	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
elo1_pre	0	1.00	1503.13	104.83	1119.60	1430.26	1504.84	1578.27	1839.66	▁▃▇▅▁
elo2_pre	0	1.00	1499.47	104.15	1156.55	1426.66	1500.86	1576.08	1849.48	▁▅▇▅▁
elo_prob1	0	1.00	0.58	0.17	0.07	0.46	0.60	0.72	0.97	▁▅▇▇▃
elo_prob2	0	1.00	0.42	0.17	0.03	0.28	0.40	0.54	0.93	▃▇▇▅▁
elo1_post	0	1.00	1502.84	107.36	1119.60	1427.69	1504.59	1580.89	1849.48	▁▃▇▅▁
elo2_post	0	1.00	1499.76	106.34	1153.90	1425.08	1500.92	1577.10	1831.46	▁▅▇▅▁
qbelo1_pre	2162	0.88	1504.29	99.90	1149.70	1434.58	1505.76	1574.54	1806.39	▁▃▇▆▁
qbelo2_pre	2162	0.88	1502.86	98.55	1152.47	1434.79	1504.38	1574.63	1814.37	▁▃▇▅▁
qb1_value_pre	2162	0.88	97.06	58.78	-53.78	54.06	91.48	133.56	329.56	▂▇▆▂▁
qb2_value_pre	2162	0.88	96.97	58.20	-47.29	54.89	91.51	132.80	327.72	▂▇▆▂▁
qb1_adj	2162	0.88	-2.06	26.84	-242.49	-9.28	1.76	12.25	119.69	▁▁▁▇▁
qb2_adj	2162	0.88	-2.09	27.46	-235.05	-9.22	1.92	12.21	116.80	▁▁▁▇▁
qbelo_prob1	2162	0.88	0.58	0.18	0.06	0.45	0.59	0.71	0.97	▁▅▇▇▃
qbelo_prob2	2162	0.88	0.42	0.18	0.03	0.29	0.41	0.55	0.94	▃▇▇▅▁
qb1_game_value	2162	0.88	109.96	133.67	-385.74	16.48	107.37	200.17	713.70	▁▅▇▂▁
qb2_game_value	2162	0.88	89.17	132.09	-413.97	-4.75	84.97	177.93	605.10	▁▃▇▃▁
qb1_value_post	2162	0.88	98.35	59.03	-46.33	55.10	92.54	134.82	327.72	▂▇▆▂▁
qb2_value_post	2162	0.88	96.19	58.39	-53.78	53.42	90.80	132.72	329.56	▂▇▆▂▁
qbelo1_post	2162	0.88	1504.27	102.32	1164.33	1432.90	1505.08	1578.37	1814.37	▁▃▇▅▁
qbelo2_post	2162	0.88	1502.88	100.79	1149.70	1432.49	1504.98	1574.87	1806.22	▁▃▇▆▁
score1	0	1.00	21.68	11.22	0.00	14.00	21.00	28.00	72.00	▅▇▃▁▁
score2	0	1.00	18.83	10.79	0.00	10.00	18.00	26.00	73.00	▇▇▃▁▁
quality	2162	0.88	47.90	29.33	0.00	21.00	48.00	73.00	100.00	▇▆▇▆▆
importance	16810	0.03	49.70	31.18	0.00	20.00	51.00	76.00	100.00	▇▅▆▆▇
total_rating	16810	0.03	48.01	26.40	0.00	26.00	48.00	68.00	100.00	▆▆▇▇▃

Data 2

Introduction and data

Identify the source of the data:

The dataset is obtained from NYC Open Data

https://data.cityofnewyork.us/Business/Consumer-Services-Mediated-Complaints/nre2-6m2s
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This project aims to examine consumer complaints data that has been handled by the DCA Consumer Services Division, with a focus on businesses operating mainly in New York City between 2013 and 2023
Write a brief description of the observations.

The original dataset comprises of 23.7k consumer complaints that were resolved over a two-year period. The complaints are categorized into 17 columns, which include details such as the Business Name, Industry, Complaint Type, Mediation Start Date, Mediation End Date, Complaint Result, Satisfaction, Restitution, Business Building, Business Street, Business Address Unit, Business City, Business State, Business Zip, Complainant Zip, Longitude, and Latitude.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1- What do the distribution of customer complaints and complaint types look like at both the national and regional levels?

2- How does the number of complaints varies over time?

3- Are there possible connections between approved restitutions and satisfaction/dissatisfaction?
A description of the research topic along with a concise statement of your hypotheses on this topic.

This is an analysis of Consumer Complaints Against Businesses Based in New York between 2013 and 2023. It aims to derive a relationship between time and the number of complaints, as well as how this number fluctuates across industries.

We expect certain industries to have higher rates of complaints than others.
Identify the types of variables in your research question. Categorical? Quantitative?

The variables are both Categorical and Quantitative.

Glimpse of data

business_complaints <- read_csv("data/Consumer_Services_Mediated_Complaints.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 23716 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Business Name, Industry, Complaint Type, Mediation Start Date, Med...
dbl  (3): Restitution, Longitude, Latitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(business_complaints)

Data summary
Name	business_complaints
Number of rows	23716
Number of columns	17
_______________________
Column type frequency:
character	14
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Business Name	719	0.97	2	88	12466
Industry	0	1.00	5	48	83
Complaint Type	1864	0.92	4	39	71
Mediation Start Date	1	1.00	10	10	1722
Mediation Close Date	0	1.00	10	10	2290
Complaint Result	0	1.00	10	50	26
Satisfaction	3949	0.83	2	50	3
Business Building	1670	0.93	1	15	4852
Business Street	2047	0.91	1	29	3338
Building Address Unit	19537	0.18	1	10	922
Business City	1609	0.93	4	20	837
Business State	1608	0.93	2	17	47
Business Zip	1608	0.93	2	5	1277
Complainant Zip	1904	0.92	2	5	1816

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Restitution	1	1.00	417.09	2742.08	0.00	0.00	0.00	50.00	132616.00	▇▁▁▁▁
Longitude	10241	0.57	-73.92	0.08	-74.25	-73.98	-73.93	-73.87	-73.70	▁▁▇▆▂
Latitude	10241	0.57	40.73	0.08	40.50	40.67	40.73	40.77	40.91	▁▅▇▇▃

Data 3

Introduction and data

Identify the source of the data.

https://www.gutenberg.org/ebooks/search/?sort_order=downloads
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Project Gutenberg is a digital library that offers over 60,000 free e-books, mostly in the public domain, which means that their copyrights have expired. The project was launched in 1971 by Michael S. Hart and it aims to encourage the creation and distribution of e-books in different formats for a variety of devices.

Write a brief description of the observations.

Since its release in 1971, it has been keeping track of the number of e-book downloads for all books which signify their popularity. The dataset includes: book title, link, author, author age, number of downloads, date published, and book popularity.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What is the relationship between book popularity and number of downloads? Does the author’s age when publishing the book and the number of books published / author influence the popularity of books?
A description of the research topic along with a concise statement of your hypotheses on this topic.

It is hypothesized that there is a positive relationship between book popularity and the number of downloads. The more popular a book is, the higher the number of downloads it will receive. Furthermore, it is possible that books published by older authors with a more significant number of books published may be more popular than those published by younger authors or authors with fewer published works.
Identify the types of variables in your research question. Categorical? Quantitative?

Categorical and quantitative

Glimpse of data

library_data <- read_csv("data/project-gutenberg-data.csv")

Rows: 100 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Title, Title_link, Thumbnail, subtitle, extra, Headline, Field1, F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(library_data)

Data summary
Name	library_data
Number of rows	100
Number of columns	10
_______________________
Column type frequency:
character	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Title	9	0.91	4	88	91
Title_link	9	0.91	35	64	91
Thumbnail	11	0.89	60	66	89
subtitle	12	0.88	5	33	65
extra	11	0.89	14	16	89
Headline	11	0.89	18	83	89
Field1	11	0.89	4	166	87
Field2	39	0.61	2	64	39
Field3	60	0.40	3	55	29
Field4	11	0.89	9	55	67

Data 4

Introduction and data

Identify the source of the data.

https://think.cs.vt.edu/corgis/csv/coffee/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data comes fromhttps://think.cs.vt.edu/corgis/csv/coffee/. It was created on 10/28/2022. Sam Donald collected the data comes from Coffee Quality Database courtesy of Buzzfeed Data Scientist James LeDoux. It’s now under the website CORGIS: The Collection of Really Great, Interesting, Situated Datasets.
Write a brief description of the observations.

There are 990 observations, representing each individual coffee bean. Each coffee bean is of a specific type, from a specific region, under a specific treatment. The dataset records the information such as location, year, species, etc. It also contains the scores of each coffee bean, evaluating its flavor, sweetness, moisture, etc.

Research Question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What combinations of location and processing method makes a good coffee?
- How does Arabica coffee bean scores from different region differ?
- How does the coffee bean’s location and altitude influence its score?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is factors influencing coffee bean scores. Coffee is a necessity in our life, especially for college students. It boosts our energy level. Seeing different types of coffee in the coffee shop, we usually face difficulty in selecting the “best” coffee according to our preferences. The dataset consists of a variety of categories to measure the score of the coffee bean, which acts as a helpful indicator in our daily life. It also records information such as types, locations, and processing methods of coffee beans. So we would like to know how these factors influence the coffee bean score. We hypothesize that location is one of the most important factors influencing the coffee bean flavor and its score.
Identify the types of variables in your research question. Categorical? Quantitative?

The independent variables consist of both categorical (country, species, processing method, etc.) and quantitative data (location altitude, production amount, year, etc). The dependent variables are scores — total scores, sweet scores, flavor scores, etc.

Glimpse of data

coffee_data <- read_csv("data/coffee.csv")

Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(coffee_data)

Data summary
Name	coffee_data
Number of rows	989
Number of columns	23
_______________________
Column type frequency:
character	7
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Location.Country	1	4	28	32
Location.Region	1	3	76	278
Data.Owner	1	3	50	263
Data.Type.Species	1	7	7	2
Data.Type.Variety	1	3	21	28
Data.Type.Processing method	1	3	25	6
Data.Color	1	4	12	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Location.Altitude.Min	1	1640.08	9192.52	0	905.00	1300.00	1550.00	190164.00	▇▁▁▁▁
Location.Altitude.Max	1	1675.93	9191.96	0	950.00	1310.00	1600.00	190164.00	▇▁▁▁▁
Location.Altitude.Average	1	1658.00	9192.06	0	950.00	1300.00	1600.00	190164.00	▇▁▁▁▁
Year	1	2013.55	1.66	2010	2012.00	2013.00	2015.00	2018.00	▁▇▃▃▁
Data.Production.Number of bags	1	151.76	125.67	1	15.00	170.00	275.00	600.00	▇▁▇▁▁
Data.Production.Bag weight	1	210.49	1666.71	0	1.00	60.00	69.00	19200.00	▇▁▁▁▁
Data.Scores.Aroma	1	7.57	0.40	0	7.42	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Flavor	1	7.52	0.42	0	7.33	7.50	7.75	8.83	▁▁▁▁▇
Data.Scores.Aftertaste	1	7.39	0.43	0	7.25	7.42	7.58	8.67	▁▁▁▁▇
Data.Scores.Acidity	1	7.54	0.40	0	7.33	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Body	1	7.51	0.39	0	7.33	7.50	7.67	8.50	▁▁▁▁▇
Data.Scores.Balance	1	7.50	0.43	0	7.33	7.50	7.75	8.58	▁▁▁▁▇
Data.Scores.Uniformity	1	9.82	0.59	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Sweetness	1	9.83	0.69	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Moisture	1	0.09	0.04	0	0.10	0.11	0.12	0.28	▃▇▆▁▁
Data.Scores.Total	1	81.97	3.86	0	81.08	82.50	83.58	90.58	▁▁▁▁▇