library(tidyverse)
library(skimr)
NFL Injury Analytics and NBA Project Proposal
Proposal
Data 1 - Injury Record
Introduction and data
Identify the source of the data.
data/InjuryRecord.csv
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
01/20/2020 - The National Football League (NFL)
Write a brief description of the observations.
This data set looks at the body part that was injured, how long each player was off of the field for that injury, and what type of surface they injured themself on. It does this by identifying the player through their five digit number and the game number they got injured in.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What surface is correlated with the injuries that keep players out of games the longest?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Statement on why this question is important: This question is important because it can help the NFL designed better fields to keep their players safer from injury.
In the dataset, there is a variable that looks at the type of surface the game was played on. There are multiple other variables(those starting with “DM”) that look at how long the injured player was kept out of games. The research question will compare the number of long-term injuries that occurred on each type of surface to help determine which surface produces the greater missed day amount. We hypothesize that the synthetic fields will have the greater amount of long-term injuries.
Identify the types of variables in your research question. Categorical? Quantitative?
The surface type is categorical, but the amount of days missed is quantitative.
Glimpse of data
#add code here
<- read.csv("data/InjuryRecord.csv")
injury skim(injury)
Name | injury |
Number of rows | 105 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
GameID | 0 | 1 | 7 | 8 | 0 | 104 | 0 |
PlayKey | 0 | 1 | 0 | 11 | 28 | 77 | 0 |
BodyPart | 0 | 1 | 4 | 5 | 0 | 5 | 0 |
Surface | 0 | 1 | 7 | 9 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PlayerKey | 0 | 1 | 42283.61 | 4163.51 | 31070 | 39656 | 43518 | 45966 | 47813 | ▁▂▃▅▇ |
DM_M1 | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
DM_M7 | 0 | 1 | 0.72 | 0.45 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
DM_M28 | 0 | 1 | 0.35 | 0.48 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
DM_M42 | 0 | 1 | 0.28 | 0.45 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
Data 2 - Play List
Introduction and data
Identify the source of the data.
data/PlayList.csv
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
01/20/2020 - The National Football League (NFL)
Write a brief description of the observations.
This dataset includes the conditions of the field when the injury occurred. It includes the stadium and field type along with the temperature and weather conditions. It identifies what play was being run when the injury happened along with that player’s position. It does this by identifying the player through their five digit number and the game number they got injured in.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Is there a correlation between the type of NFL playing surface, an injured player’s position, and the type of injury a player suffers?
- Does the type of playing surface, and a player’s position affect the length of recovery for a player’s injury?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Statement on why this question is important: These questions examine what positions are most susceptible to injury and whether or not the different types of playing surfaces the NFL uses affect the type of injuries players suffer. There has been discussions among both NFL players and fans on whether or not the NFL’s use of synthetic turf fields leads to player injuries. Also, as NFL positions require different body types and playstyles, it’s important to research whether certain positions injure certain body parts more frequently compared to other positions.
As there has been a discourse on the safety of synthetic turf fields and the role they play in injuries, our hypothesis for the first research question is that more knee injuries are suffered by players on synthetic playing surfaces compared to natural surfaces, and that these injuries take longer to recover from. Our hypothesis for the second research question is that players who are lineman suffer more injuries compared to other positions.
Identify the types of variables in your research question. Categorical? Quantitative? A player’s position, injured body part, time missed and the field type are categorical variables.
Glimpse of data
#add code here
<- read.csv("data/PlayList.csv")
playlist skim(playlist)
Name | playlist |
Number of rows | 267005 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 9 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
GameID | 0 | 1 | 7 | 8 | 0 | 5712 | 0 |
PlayKey | 0 | 1 | 9 | 12 | 0 | 267005 | 0 |
RosterPosition | 0 | 1 | 6 | 17 | 0 | 10 | 0 |
StadiumType | 0 | 1 | 0 | 22 | 16910 | 30 | 0 |
FieldType | 0 | 1 | 7 | 9 | 0 | 2 | 0 |
Weather | 0 | 1 | 0 | 80 | 18691 | 64 | 0 |
PlayType | 0 | 1 | 0 | 20 | 367 | 12 | 0 |
Position | 0 | 1 | 1 | 12 | 0 | 23 | 0 |
PositionGroup | 0 | 1 | 2 | 12 | 0 | 10 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PlayerKey | 0 | 1 | 41515.38 | 4125.86 | 26624 | 39653 | 42432 | 44480 | 47888 | ▁▁▃▇▇ |
PlayerDay | 0 | 1 | 210.45 | 183.64 | -62 | 43 | 102 | 400 | 480 | ▆▆▁▂▇ |
PlayerGame | 0 | 1 | 13.80 | 8.34 | 1 | 7 | 13 | 20 | 32 | ▇▇▆▅▃ |
Temperature | 0 | 1 | -35.03 | 304.58 | -999 | 44 | 61 | 72 | 97 | ▁▁▁▁▇ |
PlayerGamePlay | 0 | 1 | 29.06 | 19.63 | 1 | 13 | 26 | 43 | 102 | ▇▆▃▁▁ |
Data 3 - NBA Player Data
Introduction and data
Identify the source of the data.
data/nba2021stats.csv
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This data was collected by basketball-reference.com in late 2021 from SportRadar, the official statistics provider of the National Basketball Association.
Write a brief description of the observations.
This dataset contains information on every single player to play in an NBA game in the 2020-2021 NBA season. There are 540 observations, one for each player who played at least one game. The observations contain different statistics on the players, such as points per game, assists per game, games played and started, steals per game, field goal percentage, team, position and more.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Which NBA position has the best players on average, based on points, rebounds, field goal percentage and assists?
A description of the research topic along with a concise statement of your hypotheses on this topic. Statement on why this question is important: For the entire history of the NBA, there has been vigorious debate among players, fans, and pundits on which players are the best in the league and who is the best at their position. However, there hasn’t been as much discussion on which position is the best or more important in the league, which provides a new wrinkle to NBA fans’ favorite past time. This proposal will use multiple methods, such as cummaltive per game stats for each category, to see which position has the best players on average. To make the analysis more fair, biased defensive stats such as blocks per game(which favors centers over every other position and isn’t actually a good metric for measuring defensive impact) will be excluded from the analysis. This question focus on offensive statistics, with the exception of rebounding, which serves both an offensive and defensive purpose.
Hypothesis: Power forwards will have the highest cumulative average stats amongst all positions for the 2020-2021 NBA season.
Point guards will have the highest assist per game average among all positions.
Shooting guards will average the most points per game among all positions, and centers will have the highest rebounding averages.
Identify the types of variables in your research question. Categorical? Quantitative?
Categorical: Player, position
Quantitative: points per game, assists per game, rebounds per game, field goal percentage, and effective field goal percentage(weighs three point field goal percentage more than two point field goal percentage).
Glimpse of data
# add code here
<- read.csv("data/nba2021stats.csv")
nba skim(nba)
Name | nba |
Number of rows | 706 |
Number of columns | 30 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 26 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Player | 0 | 1 | 0 | 24 | 1 | 541 | 0 |
Pos | 0 | 1 | 0 | 5 | 1 | 13 | 0 |
Tm | 0 | 1 | 0 | 3 | 1 | 32 | 0 |
Player.additional | 0 | 1 | 5 | 9 | 0 | 541 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 1 | 1.00 | 25.87 | 4.09 | 19.0 | 23.00 | 25.00 | 28.00 | 40.0 | ▇▇▅▂▁ |
G | 1 | 1.00 | 37.37 | 21.27 | 1.0 | 19.00 | 37.00 | 57.00 | 72.0 | ▆▆▆▆▇ |
GS | 1 | 1.00 | 16.94 | 21.60 | 0.0 | 0.00 | 5.00 | 29.00 | 72.0 | ▇▁▁▁▁ |
MP | 1 | 1.00 | 19.44 | 9.16 | 1.8 | 12.50 | 19.30 | 26.90 | 37.6 | ▅▆▇▆▅ |
FG | 1 | 1.00 | 3.17 | 2.28 | 0.0 | 1.40 | 2.60 | 4.30 | 11.2 | ▇▇▃▁▁ |
FGA | 1 | 1.00 | 6.94 | 4.72 | 0.0 | 3.50 | 5.90 | 9.30 | 23.0 | ▇▇▃▂▁ |
FG. | 3 | 1.00 | 0.44 | 0.11 | 0.0 | 0.40 | 0.44 | 0.50 | 1.0 | ▁▃▇▁▁ |
X3P | 1 | 1.00 | 0.96 | 0.88 | 0.0 | 0.20 | 0.70 | 1.50 | 5.3 | ▇▃▁▁▁ |
X3PA | 1 | 1.00 | 2.71 | 2.23 | 0.0 | 0.90 | 2.20 | 4.10 | 12.7 | ▇▃▂▁▁ |
X3P. | 36 | 0.95 | 0.31 | 0.13 | 0.0 | 0.27 | 0.34 | 0.39 | 1.0 | ▂▇▂▁▁ |
X2P | 1 | 1.00 | 2.21 | 1.81 | 0.0 | 0.90 | 1.70 | 3.10 | 10.2 | ▇▃▂▁▁ |
X2PA | 1 | 1.00 | 4.24 | 3.32 | 0.0 | 1.70 | 3.40 | 5.80 | 16.8 | ▇▅▂▁▁ |
X2P. | 7 | 0.99 | 0.51 | 0.13 | 0.0 | 0.46 | 0.51 | 0.57 | 1.0 | ▁▁▇▂▁ |
eFG. | 3 | 1.00 | 0.51 | 0.11 | 0.0 | 0.48 | 0.52 | 0.56 | 1.0 | ▁▁▇▁▁ |
FT | 1 | 1.00 | 1.33 | 1.29 | 0.0 | 0.50 | 0.90 | 1.80 | 9.2 | ▇▂▁▁▁ |
FTA | 1 | 1.00 | 1.72 | 1.58 | 0.0 | 0.60 | 1.30 | 2.20 | 10.7 | ▇▂▁▁▁ |
FT. | 30 | 0.96 | 0.75 | 0.15 | 0.0 | 0.68 | 0.78 | 0.84 | 1.0 | ▁▁▂▇▇ |
ORB | 1 | 1.00 | 0.81 | 0.73 | 0.0 | 0.30 | 0.60 | 1.00 | 4.7 | ▇▂▁▁▁ |
DRB | 1 | 1.00 | 2.77 | 1.82 | 0.0 | 1.50 | 2.50 | 3.70 | 10.1 | ▇▇▃▁▁ |
TRB | 1 | 1.00 | 3.58 | 2.38 | 0.0 | 1.90 | 3.10 | 4.80 | 14.3 | ▇▇▂▁▁ |
AST | 1 | 1.00 | 1.93 | 1.81 | 0.0 | 0.70 | 1.40 | 2.50 | 11.7 | ▇▂▁▁▁ |
STL | 1 | 1.00 | 0.61 | 0.39 | 0.0 | 0.30 | 0.60 | 0.90 | 2.1 | ▇▇▅▁▁ |
BLK | 1 | 1.00 | 0.42 | 0.41 | 0.0 | 0.10 | 0.30 | 0.60 | 3.4 | ▇▂▁▁▁ |
TOV | 1 | 1.00 | 1.07 | 0.81 | 0.0 | 0.50 | 0.90 | 1.40 | 5.0 | ▇▃▁▁▁ |
PF | 1 | 1.00 | 1.62 | 0.76 | 0.0 | 1.10 | 1.60 | 2.10 | 4.0 | ▃▇▇▃▁ |
PTS | 1 | 1.00 | 8.62 | 6.27 | 0.0 | 4.00 | 7.20 | 11.70 | 32.0 | ▇▆▂▁▁ |