library(tidyverse)
library(skimr)
library(dplyr)
Predicting NFL Team Performance with ELO Rating, Home-Field Advantage, and QB Rating
Proposal
Data 1
Introduction and data
- Identify the source of the data.
https://github.com/fivethirtyeight/data/tree/master/nfl-elo
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
For several years now, FiveThirtyEight has been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their football ratings date back to 1920. This data is collected following every NFL game, regular and postseason.
- Write a brief description of the observations.
Each observation is a single NFL game that has occurred since 1920. The data is broken down into columns: date, team names, scores, quarterbacks for both teams, their elo (overall rating), before and after the game. Using the Elo rating system, fivethirtyeight takes strength of schedule into account when calculating ratings, thus providing a more accurate depiction of a team’s skill levels before and after every game they have played in their history.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How have individual team performances(ratings) fluctuated over time? Which NFL teams have been the most or least successful throughout NFL history?
- How have scores changed over time in the NFL? Are games higher scoring now than they were in 1920? Which eras saw the highest or lowest scoring games? What can explain this?
- Is home field advantage really that effective? How has one team in particular fared at home vs away? Does their margin of victory/defeat differ?
These research questions are important for various purposes such as sports betting, personal interest, or rules changes.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic is NFL games. For people who are not familiar with this, the NFL is the American football league founded in 1920. It comprises of 32 teams across the country that compete in 17 regular season games, and this season concludes with a single-elimination playoff bracket. In this dataset, the Elo rating system is utilized. Elo is a simple measure of team strength and performance that is calculated based on game-by-game results. This provides an accurate description of a team’s rating before and after every game they have played. For my research questions, I would hypothesize that teams such as the Browns have had the least success throughout their history and their Elo rating has remained quite low. For the others, I would hypothesize that games have become higher scoring over time and playing on one’s home field is very important in determining the outcome of a given game.
- Identify the types of variables in your research question. Categorical? Quantitative?
Most of the variables in this research question are quantitative. The teams are categorical, but variables such as points scored, elo rating and date are all quantitative and continuous.
Glimpse of data
# add code here
<- read_csv("data/nfl_elo.csv") nfl
Rows: 17379 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): playoff, team1, team2, qb1, qb2
dbl (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(nfl) skimr
Name | nfl |
Number of rows | 17379 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 5 |
Date | 1 |
numeric | 27 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
playoff | 16763 | 0.04 | 1 | 1 | 0 | 4 | 0 |
team1 | 0 | 1.00 | 2 | 3 | 0 | 101 | 0 |
team2 | 0 | 1.00 | 2 | 3 | 0 | 108 | 0 |
qb1 | 2162 | 0.88 | 7 | 18 | 0 | 648 | 0 |
qb2 | 2162 | 0.88 | 7 | 18 | 0 | 670 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 1920-09-26 | 2023-02-12 | 1989-11-19 | 3622 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
season | 0 | 1.00 | 1984.86 | 26.32 | 1920.00 | 1968.00 | 1989.00 | 2006.00 | 2022.00 | ▂▂▆▇▇ |
neutral | 0 | 1.00 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
elo1_pre | 0 | 1.00 | 1503.13 | 104.83 | 1119.60 | 1430.26 | 1504.84 | 1578.27 | 1839.66 | ▁▃▇▅▁ |
elo2_pre | 0 | 1.00 | 1499.47 | 104.15 | 1156.55 | 1426.66 | 1500.86 | 1576.08 | 1849.48 | ▁▅▇▅▁ |
elo_prob1 | 0 | 1.00 | 0.58 | 0.17 | 0.07 | 0.46 | 0.60 | 0.72 | 0.97 | ▁▅▇▇▃ |
elo_prob2 | 0 | 1.00 | 0.42 | 0.17 | 0.03 | 0.28 | 0.40 | 0.54 | 0.93 | ▃▇▇▅▁ |
elo1_post | 0 | 1.00 | 1502.84 | 107.36 | 1119.60 | 1427.69 | 1504.59 | 1580.89 | 1849.48 | ▁▃▇▅▁ |
elo2_post | 0 | 1.00 | 1499.76 | 106.34 | 1153.90 | 1425.08 | 1500.92 | 1577.10 | 1831.46 | ▁▅▇▅▁ |
qbelo1_pre | 2162 | 0.88 | 1504.29 | 99.90 | 1149.70 | 1434.58 | 1505.76 | 1574.54 | 1806.39 | ▁▃▇▆▁ |
qbelo2_pre | 2162 | 0.88 | 1502.86 | 98.55 | 1152.47 | 1434.79 | 1504.38 | 1574.63 | 1814.37 | ▁▃▇▅▁ |
qb1_value_pre | 2162 | 0.88 | 97.06 | 58.78 | -53.78 | 54.06 | 91.48 | 133.56 | 329.56 | ▂▇▆▂▁ |
qb2_value_pre | 2162 | 0.88 | 96.97 | 58.20 | -47.29 | 54.89 | 91.51 | 132.80 | 327.72 | ▂▇▆▂▁ |
qb1_adj | 2162 | 0.88 | -2.06 | 26.84 | -242.49 | -9.28 | 1.76 | 12.25 | 119.69 | ▁▁▁▇▁ |
qb2_adj | 2162 | 0.88 | -2.09 | 27.46 | -235.05 | -9.22 | 1.92 | 12.21 | 116.80 | ▁▁▁▇▁ |
qbelo_prob1 | 2162 | 0.88 | 0.58 | 0.18 | 0.06 | 0.45 | 0.59 | 0.71 | 0.97 | ▁▅▇▇▃ |
qbelo_prob2 | 2162 | 0.88 | 0.42 | 0.18 | 0.03 | 0.29 | 0.41 | 0.55 | 0.94 | ▃▇▇▅▁ |
qb1_game_value | 2162 | 0.88 | 109.96 | 133.67 | -385.74 | 16.48 | 107.37 | 200.17 | 713.70 | ▁▅▇▂▁ |
qb2_game_value | 2162 | 0.88 | 89.17 | 132.09 | -413.97 | -4.75 | 84.97 | 177.93 | 605.10 | ▁▃▇▃▁ |
qb1_value_post | 2162 | 0.88 | 98.35 | 59.03 | -46.33 | 55.10 | 92.54 | 134.82 | 327.72 | ▂▇▆▂▁ |
qb2_value_post | 2162 | 0.88 | 96.19 | 58.39 | -53.78 | 53.42 | 90.80 | 132.72 | 329.56 | ▂▇▆▂▁ |
qbelo1_post | 2162 | 0.88 | 1504.27 | 102.32 | 1164.33 | 1432.90 | 1505.08 | 1578.37 | 1814.37 | ▁▃▇▅▁ |
qbelo2_post | 2162 | 0.88 | 1502.88 | 100.79 | 1149.70 | 1432.49 | 1504.98 | 1574.87 | 1806.22 | ▁▃▇▆▁ |
score1 | 0 | 1.00 | 21.68 | 11.22 | 0.00 | 14.00 | 21.00 | 28.00 | 72.00 | ▅▇▃▁▁ |
score2 | 0 | 1.00 | 18.83 | 10.79 | 0.00 | 10.00 | 18.00 | 26.00 | 73.00 | ▇▇▃▁▁ |
quality | 2162 | 0.88 | 47.90 | 29.33 | 0.00 | 21.00 | 48.00 | 73.00 | 100.00 | ▇▆▇▆▆ |
importance | 16810 | 0.03 | 49.70 | 31.18 | 0.00 | 20.00 | 51.00 | 76.00 | 100.00 | ▇▅▆▆▇ |
total_rating | 16810 | 0.03 | 48.01 | 26.40 | 0.00 | 26.00 | 48.00 | 68.00 | 100.00 | ▆▆▇▇▃ |
Data 2
Introduction and data
Identify the source of the data:
The dataset is obtained from NYC Open Data
https://data.cityofnewyork.us/Business/Consumer-Services-Mediated-Complaints/nre2-6m2s
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This project aims to examine consumer complaints data that has been handled by the DCA Consumer Services Division, with a focus on businesses operating mainly in New York City between 2013 and 2023
Write a brief description of the observations.
The original dataset comprises of 23.7k consumer complaints that were resolved over a two-year period. The complaints are categorized into 17 columns, which include details such as the Business Name, Industry, Complaint Type, Mediation Start Date, Mediation End Date, Complaint Result, Satisfaction, Restitution, Business Building, Business Street, Business Address Unit, Business City, Business State, Business Zip, Complainant Zip, Longitude, and Latitude.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
1- What do the distribution of customer complaints and complaint types look like at both the national and regional levels?
2- How does the number of complaints varies over time?
3- Are there possible connections between approved restitutions and satisfaction/dissatisfaction?
A description of the research topic along with a concise statement of your hypotheses on this topic.
This is an analysis of Consumer Complaints Against Businesses Based in New York between 2013 and 2023. It aims to derive a relationship between time and the number of complaints, as well as how this number fluctuates across industries.
We expect certain industries to have higher rates of complaints than others.
Identify the types of variables in your research question. Categorical? Quantitative?
The variables are both Categorical and Quantitative.
Glimpse of data
<- read_csv("data/Consumer_Services_Mediated_Complaints.csv") business_complaints
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 23716 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Business Name, Industry, Complaint Type, Mediation Start Date, Med...
dbl (3): Restitution, Longitude, Latitude
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(business_complaints) skimr
Name | business_complaints |
Number of rows | 23716 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
character | 14 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Business Name | 719 | 0.97 | 2 | 88 | 0 | 12466 | 0 |
Industry | 0 | 1.00 | 5 | 48 | 0 | 83 | 0 |
Complaint Type | 1864 | 0.92 | 4 | 39 | 0 | 71 | 0 |
Mediation Start Date | 1 | 1.00 | 10 | 10 | 0 | 1722 | 0 |
Mediation Close Date | 0 | 1.00 | 10 | 10 | 0 | 2290 | 0 |
Complaint Result | 0 | 1.00 | 10 | 50 | 0 | 26 | 0 |
Satisfaction | 3949 | 0.83 | 2 | 50 | 0 | 3 | 0 |
Business Building | 1670 | 0.93 | 1 | 15 | 0 | 4852 | 0 |
Business Street | 2047 | 0.91 | 1 | 29 | 0 | 3338 | 0 |
Building Address Unit | 19537 | 0.18 | 1 | 10 | 0 | 922 | 0 |
Business City | 1609 | 0.93 | 4 | 20 | 0 | 837 | 0 |
Business State | 1608 | 0.93 | 2 | 17 | 0 | 47 | 0 |
Business Zip | 1608 | 0.93 | 2 | 5 | 0 | 1277 | 0 |
Complainant Zip | 1904 | 0.92 | 2 | 5 | 0 | 1816 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Restitution | 1 | 1.00 | 417.09 | 2742.08 | 0.00 | 0.00 | 0.00 | 50.00 | 132616.00 | ▇▁▁▁▁ |
Longitude | 10241 | 0.57 | -73.92 | 0.08 | -74.25 | -73.98 | -73.93 | -73.87 | -73.70 | ▁▁▇▆▂ |
Latitude | 10241 | 0.57 | 40.73 | 0.08 | 40.50 | 40.67 | 40.73 | 40.77 | 40.91 | ▁▅▇▇▃ |
Data 3
Introduction and data
Identify the source of the data.
https://www.gutenberg.org/ebooks/search/?sort_order=downloads
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Project Gutenberg is a digital library that offers over 60,000 free e-books, mostly in the public domain, which means that their copyrights have expired. The project was launched in 1971 by Michael S. Hart and it aims to encourage the creation and distribution of e-books in different formats for a variety of devices.
Write a brief description of the observations.
Since its release in 1971, it has been keeping track of the number of e-book downloads for all books which signify their popularity. The dataset includes: book title, link, author, author age, number of downloads, date published, and book popularity.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What is the relationship between book popularity and number of downloads? Does the author’s age when publishing the book and the number of books published / author influence the popularity of books?
A description of the research topic along with a concise statement of your hypotheses on this topic.
It is hypothesized that there is a positive relationship between book popularity and the number of downloads. The more popular a book is, the higher the number of downloads it will receive. Furthermore, it is possible that books published by older authors with a more significant number of books published may be more popular than those published by younger authors or authors with fewer published works.
Identify the types of variables in your research question. Categorical? Quantitative?
Categorical and quantitative
Glimpse of data
<- read_csv("data/project-gutenberg-data.csv") library_data
Rows: 100 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Title, Title_link, Thumbnail, subtitle, extra, Headline, Field1, F...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(library_data) skimr
Name | library_data |
Number of rows | 100 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Title | 9 | 0.91 | 4 | 88 | 0 | 91 | 0 |
Title_link | 9 | 0.91 | 35 | 64 | 0 | 91 | 0 |
Thumbnail | 11 | 0.89 | 60 | 66 | 0 | 89 | 0 |
subtitle | 12 | 0.88 | 5 | 33 | 0 | 65 | 0 |
extra | 11 | 0.89 | 14 | 16 | 0 | 89 | 0 |
Headline | 11 | 0.89 | 18 | 83 | 0 | 89 | 0 |
Field1 | 11 | 0.89 | 4 | 166 | 0 | 87 | 0 |
Field2 | 39 | 0.61 | 2 | 64 | 0 | 39 | 0 |
Field3 | 60 | 0.40 | 3 | 55 | 0 | 29 | 0 |
Field4 | 11 | 0.89 | 9 | 55 | 0 | 67 | 0 |
Data 4
Introduction and data
Identify the source of the data.
https://think.cs.vt.edu/corgis/csv/coffee/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data comes fromhttps://think.cs.vt.edu/corgis/csv/coffee/. It was created on 10/28/2022. Sam Donald collected the data comes from Coffee Quality Database courtesy of Buzzfeed Data Scientist James LeDoux. It’s now under the website CORGIS: The Collection of Really Great, Interesting, Situated Datasets.
Write a brief description of the observations.
There are 990 observations, representing each individual coffee bean. Each coffee bean is of a specific type, from a specific region, under a specific treatment. The dataset records the information such as location, year, species, etc. It also contains the scores of each coffee bean, evaluating its flavor, sweetness, moisture, etc.
Research Question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What combinations of location and processing method makes a good coffee?
How does Arabica coffee bean scores from different region differ?
How does the coffee bean’s location and altitude influence its score?
A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic is factors influencing coffee bean scores. Coffee is a necessity in our life, especially for college students. It boosts our energy level. Seeing different types of coffee in the coffee shop, we usually face difficulty in selecting the “best” coffee according to our preferences. The dataset consists of a variety of categories to measure the score of the coffee bean, which acts as a helpful indicator in our daily life. It also records information such as types, locations, and processing methods of coffee beans. So we would like to know how these factors influence the coffee bean score. We hypothesize that location is one of the most important factors influencing the coffee bean flavor and its score.
Identify the types of variables in your research question. Categorical? Quantitative?
The independent variables consist of both categorical (country, species, processing method, etc.) and quantitative data (location altitude, production amount, year, etc). The dependent variables are scores — total scores, sweet scores, flavor scores, etc.
Glimpse of data
<- read_csv("data/coffee.csv") coffee_data
Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(coffee_data) skimr
Name | coffee_data |
Number of rows | 989 |
Number of columns | 23 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 16 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Location.Country | 0 | 1 | 4 | 28 | 0 | 32 | 0 |
Location.Region | 0 | 1 | 3 | 76 | 0 | 278 | 0 |
Data.Owner | 0 | 1 | 3 | 50 | 0 | 263 | 0 |
Data.Type.Species | 0 | 1 | 7 | 7 | 0 | 2 | 0 |
Data.Type.Variety | 0 | 1 | 3 | 21 | 0 | 28 | 0 |
Data.Type.Processing method | 0 | 1 | 3 | 25 | 0 | 6 | 0 |
Data.Color | 0 | 1 | 4 | 12 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Location.Altitude.Min | 0 | 1 | 1640.08 | 9192.52 | 0 | 905.00 | 1300.00 | 1550.00 | 190164.00 | ▇▁▁▁▁ |
Location.Altitude.Max | 0 | 1 | 1675.93 | 9191.96 | 0 | 950.00 | 1310.00 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
Location.Altitude.Average | 0 | 1 | 1658.00 | 9192.06 | 0 | 950.00 | 1300.00 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
Year | 0 | 1 | 2013.55 | 1.66 | 2010 | 2012.00 | 2013.00 | 2015.00 | 2018.00 | ▁▇▃▃▁ |
Data.Production.Number of bags | 0 | 1 | 151.76 | 125.67 | 1 | 15.00 | 170.00 | 275.00 | 600.00 | ▇▁▇▁▁ |
Data.Production.Bag weight | 0 | 1 | 210.49 | 1666.71 | 0 | 1.00 | 60.00 | 69.00 | 19200.00 | ▇▁▁▁▁ |
Data.Scores.Aroma | 0 | 1 | 7.57 | 0.40 | 0 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
Data.Scores.Flavor | 0 | 1 | 7.52 | 0.42 | 0 | 7.33 | 7.50 | 7.75 | 8.83 | ▁▁▁▁▇ |
Data.Scores.Aftertaste | 0 | 1 | 7.39 | 0.43 | 0 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▁▁▁▇ |
Data.Scores.Acidity | 0 | 1 | 7.54 | 0.40 | 0 | 7.33 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
Data.Scores.Body | 0 | 1 | 7.51 | 0.39 | 0 | 7.33 | 7.50 | 7.67 | 8.50 | ▁▁▁▁▇ |
Data.Scores.Balance | 0 | 1 | 7.50 | 0.43 | 0 | 7.33 | 7.50 | 7.75 | 8.58 | ▁▁▁▁▇ |
Data.Scores.Uniformity | 0 | 1 | 9.82 | 0.59 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
Data.Scores.Sweetness | 0 | 1 | 9.83 | 0.69 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
Data.Scores.Moisture | 0 | 1 | 0.09 | 0.04 | 0 | 0.10 | 0.11 | 0.12 | 0.28 | ▃▇▆▁▁ |
Data.Scores.Total | 0 | 1 | 81.97 | 3.86 | 0 | 81.08 | 82.50 | 83.58 | 90.58 | ▁▁▁▁▇ |