Where does it pay to attend college?

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    Data was downloaded from FBref, they collected and organized the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    This Data was compiled during the 2022 Qatar Fifa world cup. Data was collected by FBref, a website devoted to tracking statistics for football teams and players from around the world.

  • Write a brief description of the observations.

    The observations are statistics for every different player that attended the world cup. These statistics range from those for goalkeepers, for defensive players or for attacking players. They measure performance metrics such as number of shots, number of tackles, number of passes, or crosses. It is very detailed and has information about every player and a very large variety of statistics for every player who attended.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    How does a player’s age and the league they play in relate to their performance in the FIFA World Cup?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    We will evaluate players based on their age at the time of the 2022 world cup as well as the soccer league in which they play in. These two variables will then be coupled with a performance variable that is made up of various parameters that evaluate performance (such as minutes played, expected goals, expected assisted goals, # of progressive actions, among many others). At first glance, we believe that a player’s performance will be better if he plays in a European top 5 league (England, Germany, Spain, Italy, or France) and he is in the second half of his 20’s age wise. This question is important because it can be used to make better staffing and recruiting decisions for each country’s team.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    The variables in our research question are quantitative (categorical), league that the players plays in (categorical), and performance (quantitative).

Glimpse of data

# add code here
player_defense <- read_csv("data/player_defense.csv")
Rows: 680 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (18): birth_year, minutes_90s, tackles, tackles_won, tackles_def_3rd, ta...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_defense)
# A tibble: 6 × 22
  player         position team  age   birth_year minutes_90s tackles tackles_won
  <chr>          <chr>    <chr> <chr>      <dbl>       <dbl>   <dbl>       <dbl>
1 Aaron Mooy     MF       Aust… 32-0…       1990         4         9           6
2 Aaron Ramsey   MF       Wales 31-3…       1990         3         2           0
3 Abdelhamid Sa… MF       Moro… 26-0…       1996         2         3           1
4 Abdelkarim Ha… DF       Qatar 29-1…       1993         3         7           3
5 Abderrazak Ha… FW       Moro… 32-0…       1990         0.8       0           0
6 Abdessamad Ez… FW       Moro… 21-0…       2001         1         3           2
# ℹ 14 more variables: tackles_def_3rd <dbl>, tackles_mid_3rd <dbl>,
#   tackles_att_3rd <dbl>, dribble_tackles <dbl>, dribbles_vs <dbl>,
#   dribble_tackles_pct <dbl>, dribbled_past <dbl>, blocks <dbl>,
#   blocked_shots <dbl>, blocked_passes <dbl>, interceptions <dbl>,
#   tackles_interceptions <dbl>, clearances <dbl>, errors <dbl>
player_gca <- read_csv("data/player_gca.csv")
Rows: 680 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (18): birth_year, minutes_90s, sca, sca_per90, sca_passes_live, sca_pass...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_gca)
# A tibble: 6 × 22
  player             position team  age   birth_year minutes_90s   sca sca_per90
  <chr>              <chr>    <chr> <chr>      <dbl>       <dbl> <dbl>     <dbl>
1 Aaron Mooy         MF       Aust… 32-0…       1990         4       5      1.25
2 Aaron Ramsey       MF       Wales 31-3…       1990         3       3      1.02
3 Abdelhamid Sabiri  MF       Moro… 26-0…       1996         2       4      2.4 
4 Abdelkarim Hassan  DF       Qatar 29-1…       1993         3       4      1.33
5 Abderrazak Hamdal… FW       Moro… 32-0…       1990         0.8     0      0   
6 Abdessamad Ezzalz… FW       Moro… 21-0…       2001         1       4      5.63
# ℹ 14 more variables: sca_passes_live <dbl>, sca_passes_dead <dbl>,
#   sca_dribbles <dbl>, sca_shots <dbl>, sca_fouled <dbl>, sca_defense <dbl>,
#   gca <dbl>, gca_per90 <dbl>, gca_passes_live <dbl>, gca_passes_dead <dbl>,
#   gca_dribbles <dbl>, gca_shots <dbl>, gca_fouled <dbl>, gca_defense <dbl>
player_keepers <- read_csv("data/player_keepers.csv")
Rows: 41 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): player, position, team, age, club
dbl (20): birth_year, gk_games, gk_games_starts, gk_minutes, minutes_90s, gk...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_keepers)
# A tibble: 6 × 25
  player          position team  age   club  birth_year gk_games gk_games_starts
  <chr>           <chr>    <chr> <chr> <chr>      <dbl>    <dbl>           <dbl>
1 Aimen Dahmen    GK       Tuni… 25-3… CS S…       1997        3               3
2 Alireza Beiran… GK       IR I… 30-0… Pers…       1992        2               2
3 Alisson         GK       Braz… 30-0… Live…       1992        4               4
4 Andries Noppert GK       Neth… 28-2… Heer…       1994        5               5
5 André Onana     GK       Came… 26-2… Inter       1996        1               1
6 Danny Ward      GK       Wales 29-1… Leic…       1993        2               1
# ℹ 17 more variables: gk_minutes <dbl>, minutes_90s <dbl>,
#   gk_goals_against <dbl>, gk_goals_against_per90 <dbl>,
#   gk_shots_on_target_against <dbl>, gk_saves <dbl>, gk_save_pct <dbl>,
#   gk_wins <dbl>, gk_ties <dbl>, gk_losses <dbl>, gk_clean_sheets <dbl>,
#   gk_clean_sheets_pct <dbl>, gk_pens_att <dbl>, gk_pens_allowed <dbl>,
#   gk_pens_saved <dbl>, gk_pens_missed <dbl>, gk_pens_save_pct <dbl>
player_keepersadv <- read_csv("data/player_keepersadv.csv")
Rows: 41 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (27): birth_year, minutes_90s, gk_goals_against, gk_pens_allowed, gk_fre...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_keepersadv)
# A tibble: 6 × 31
  player            position team  age   birth_year minutes_90s gk_goals_against
  <chr>             <chr>    <chr> <chr>      <dbl>       <dbl>            <dbl>
1 Aimen Dahmen      GK       Tuni… 25-3…       1997         3                  1
2 Alireza Beiranva… GK       IR I… 30-0…       1992         1.2                1
3 Alisson           GK       Braz… 30-0…       1992         4.2                2
4 Andries Noppert   GK       Neth… 28-2…       1994         5.3                4
5 André Onana       GK       Came… 26-2…       1996         0.9                1
6 Danny Ward        GK       Wales 29-1…       1993         1                  5
# ℹ 24 more variables: gk_pens_allowed <dbl>, gk_free_kick_goals_against <dbl>,
#   gk_corner_kick_goals_against <dbl>, gk_own_goals_against <dbl>,
#   gk_psxg <dbl>, gk_psnpxg_per_shot_on_target_against <dbl>,
#   gk_psxg_net <dbl>, gk_psxg_net_per90 <dbl>,
#   gk_passes_completed_launched <dbl>, gk_passes_launched <dbl>,
#   gk_passes_pct_launched <dbl>, gk_passes <dbl>, gk_passes_throws <dbl>,
#   gk_pct_passes_launched <dbl>, gk_passes_length_avg <dbl>, …
player_misc <- read_csv("data/player_misc.csv")
Rows: 680 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (18): birth_year, minutes_90s, cards_yellow, cards_red, cards_yellow_red...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_misc)
# A tibble: 6 × 22
  player      position team  age   birth_year minutes_90s cards_yellow cards_red
  <chr>       <chr>    <chr> <chr>      <dbl>       <dbl>        <dbl>     <dbl>
1 Aaron Mooy  MF       Aust… 32-0…       1990         4              1         0
2 Aaron Rams… MF       Wales 31-3…       1990         3              1         0
3 Abdelhamid… MF       Moro… 26-0…       1996         2              1         0
4 Abdelkarim… DF       Qatar 29-1…       1993         3              0         0
5 Abderrazak… FW       Moro… 32-0…       1990         0.8            0         0
6 Abdessamad… FW       Moro… 21-0…       2001         1              0         0
# ℹ 14 more variables: cards_yellow_red <dbl>, fouls <dbl>, fouled <dbl>,
#   offsides <dbl>, crosses <dbl>, interceptions <dbl>, tackles_won <dbl>,
#   pens_won <dbl>, pens_conceded <dbl>, own_goals <dbl>,
#   ball_recoveries <dbl>, aerials_won <dbl>, aerials_lost <dbl>,
#   aerials_won_pct <dbl>
player_passing <- read_csv("data/player_passing.csv")
Rows: 680 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (25): birth_year, minutes_90s, passes_completed, passes, passes_pct, pas...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_passing)
# A tibble: 6 × 29
  player     position team  age   birth_year minutes_90s passes_completed passes
  <chr>      <chr>    <chr> <chr>      <dbl>       <dbl>            <dbl>  <dbl>
1 Aaron Mooy MF       Aust… 32-0…       1990         4                170    217
2 Aaron Ram… MF       Wales 31-3…       1990         3                 88    112
3 Abdelhami… MF       Moro… 26-0…       1996         2                 45     58
4 Abdelkari… DF       Qatar 29-1…       1993         3                122    161
5 Abderraza… FW       Moro… 32-0…       1990         0.8                8     15
6 Abdessama… FW       Moro… 21-0…       2001         1                 10     13
# ℹ 21 more variables: passes_pct <dbl>, passes_total_distance <dbl>,
#   passes_progressive_distance <dbl>, passes_completed_short <dbl>,
#   passes_short <dbl>, passes_pct_short <dbl>, passes_completed_medium <dbl>,
#   passes_medium <dbl>, passes_pct_medium <dbl>, passes_completed_long <dbl>,
#   passes_long <dbl>, passes_pct_long <dbl>, assists <dbl>, xg_assist <dbl>,
#   pass_xa <dbl>, xg_assist_net <dbl>, assisted_shots <dbl>,
#   passes_into_final_third <dbl>, passes_into_penalty_area <dbl>, …
player_passing_types <- read_csv("data/player_passing_types.csv")
Rows: 680 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (17): birth_year, minutes_90s, passes, passes_live, passes_dead, passes_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_passing_types)
# A tibble: 6 × 21
  player          position team  age   birth_year minutes_90s passes passes_live
  <chr>           <chr>    <chr> <chr>      <dbl>       <dbl>  <dbl>       <dbl>
1 Aaron Mooy      MF       Aust… 32-0…       1990         4      217         206
2 Aaron Ramsey    MF       Wales 31-3…       1990         3      112         101
3 Abdelhamid Sab… MF       Moro… 26-0…       1996         2       58          55
4 Abdelkarim Has… DF       Qatar 29-1…       1993         3      161         148
5 Abderrazak Ham… FW       Moro… 32-0…       1990         0.8     15          14
6 Abdessamad Ezz… FW       Moro… 21-0…       2001         1       13          12
# ℹ 13 more variables: passes_dead <dbl>, passes_free_kicks <dbl>,
#   through_balls <dbl>, passes_switches <dbl>, crosses <dbl>, throw_ins <dbl>,
#   corner_kicks <dbl>, corner_kicks_in <dbl>, corner_kicks_out <dbl>,
#   corner_kicks_straight <dbl>, passes_completed <dbl>, passes_offsides <dbl>,
#   passes_blocked <dbl>
player_playingtime <- read_csv("data/player_playingtime.csv")
Rows: 829 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (23): birth_year, games, minutes, minutes_per_game, minutes_pct, minutes...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_playingtime)
# A tibble: 6 × 27
  player          position team  age   birth_year games minutes minutes_per_game
  <chr>           <chr>    <chr> <chr>      <dbl> <dbl>   <dbl>            <dbl>
1 Aaron Long      DF       Unit… 30-0…       1992     0      NA               NA
2 Aaron Mooy      MF       Aust… 32-0…       1990     4     360               90
3 Aaron Ramsdale  GK       Engl… 24-2…       1998     0      NA               NA
4 Aaron Ramsey    MF       Wales 31-3…       1990     3     266               89
5 Abdelhamid Sab… MF       Moro… 26-0…       1996     5     181               36
6 Abdelkarim Has… DF       Qatar 29-1…       1993     3     270               90
# ℹ 19 more variables: minutes_pct <dbl>, minutes_90s <dbl>,
#   games_starts <dbl>, minutes_per_start <dbl>, games_complete <dbl>,
#   games_subs <dbl>, minutes_per_sub <dbl>, unused_subs <dbl>,
#   points_per_game <dbl>, on_goals_for <dbl>, on_goals_against <dbl>,
#   plus_minus <dbl>, plus_minus_per90 <dbl>, plus_minus_wowy <dbl>,
#   on_xg_for <dbl>, on_xg_against <dbl>, xg_plus_minus <dbl>,
#   xg_plus_minus_per90 <dbl>, xg_plus_minus_wowy <dbl>
player_possession <- read_csv("data/player_possession.csv")
Rows: 680 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): player, position, team, age
dbl (16): birth_year, minutes_90s, touches, touches_def_pen_area, touches_de...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_possession)
# A tibble: 6 × 20
  player                position team      age    birth_year minutes_90s touches
  <chr>                 <chr>    <chr>     <chr>       <dbl>       <dbl>   <dbl>
1 Aaron Mooy            MF       Australia 32-094       1990         4       255
2 Aaron Ramsey          MF       Wales     31-357       1990         3       147
3 Abdelhamid Sabiri     MF       Morocco   26-020       1996         2        86
4 Abdelkarim Hassan     DF       Qatar     29-112       1993         3       193
5 Abderrazak Hamdallah  FW       Morocco   32-001       1990         0.8      28
6 Abdessamad Ezzalzouli FW       Morocco   21-001       2001         1        40
# ℹ 13 more variables: touches_def_pen_area <dbl>, touches_def_3rd <dbl>,
#   touches_mid_3rd <dbl>, touches_att_3rd <dbl>, touches_att_pen_area <dbl>,
#   touches_live_ball <dbl>, dribbles_completed <dbl>, dribbles <dbl>,
#   dribbles_completed_pct <dbl>, miscontrols <dbl>, dispossessed <dbl>,
#   passes_received <dbl>, progressive_passes_received <dbl>
player_stats <- read_csv("data/player_stats.csv")
Rows: 680 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): player, position, team, age, club
dbl (26): birth_year, games, games_starts, minutes, minutes_90s, goals, assi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(player_stats)
# A tibble: 6 × 31
  player        position team  age   club  birth_year games games_starts minutes
  <chr>         <chr>    <chr> <chr> <chr>      <dbl> <dbl>        <dbl>   <dbl>
1 Aaron Mooy    MF       Aust… 32-0… Celt…       1990     4            4     360
2 Aaron Ramsey  MF       Wales 31-3… Nice        1990     3            3     266
3 Abdelhamid S… MF       Moro… 26-0… Samp…       1996     5            2     181
4 Abdelkarim H… DF       Qatar 29-1… Al S…       1993     3            3     270
5 Abderrazak H… FW       Moro… 32-0… Al-I…       1990     4            0      68
6 Abdessamad E… FW       Moro… 21-0… Osas…       2001     3            0      93
# ℹ 22 more variables: minutes_90s <dbl>, goals <dbl>, assists <dbl>,
#   goals_pens <dbl>, pens_made <dbl>, pens_att <dbl>, cards_yellow <dbl>,
#   cards_red <dbl>, goals_per90 <dbl>, assists_per90 <dbl>,
#   goals_assists_per90 <dbl>, goals_pens_per90 <dbl>,
#   goals_assists_pens_per90 <dbl>, xg <dbl>, npxg <dbl>, xg_assist <dbl>,
#   npxg_xg_assist <dbl>, xg_per90 <dbl>, xg_assist_per90 <dbl>,
#   xg_xg_assist_per90 <dbl>, npxg_per90 <dbl>, npxg_xg_assist_per90 <dbl>
skim(player_defense)
Data summary
Name player_defense
Number of rows 680
Number of columns 22
_______________________
Column type frequency:
character 4
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983 1992.0 1995.0 1998.00 2004.0 ▁▅▇▇▃
minutes_90s 0 1.00 2.12 1.64 0 0.8 1.9 3.00 7.7 ▇▆▂▁▁
tackles 3 1.00 3.05 3.55 0 0.0 2.0 4.00 26.0 ▇▂▁▁▁
tackles_won 0 1.00 1.76 2.33 0 0.0 1.0 3.00 17.0 ▇▁▁▁▁
tackles_def_3rd 3 1.00 1.55 2.23 0 0.0 1.0 2.00 15.0 ▇▁▁▁▁
tackles_mid_3rd 3 1.00 1.17 1.57 0 0.0 1.0 2.00 11.0 ▇▁▁▁▁
tackles_att_3rd 3 1.00 0.34 0.72 0 0.0 0.0 0.00 5.0 ▇▁▁▁▁
dribble_tackles 3 1.00 1.31 1.88 0 0.0 1.0 2.00 17.0 ▇▁▁▁▁
dribbles_vs 3 1.00 2.40 2.89 0 0.0 2.0 3.00 28.0 ▇▁▁▁▁
dribble_tackles_pct 197 0.71 51.44 37.60 0 0.0 50.0 92.85 100.0 ▇▃▆▅▇
dribbled_past 3 1.00 1.09 1.48 0 0.0 1.0 2.00 11.0 ▇▁▁▁▁
blocks 3 1.00 2.12 2.37 0 0.0 1.0 3.00 13.0 ▇▃▁▁▁
blocked_shots 3 1.00 0.54 1.05 0 0.0 0.0 1.00 7.0 ▇▁▁▁▁
blocked_passes 3 1.00 1.58 1.93 0 0.0 1.0 2.00 12.0 ▇▂▁▁▁
interceptions 0 1.00 1.57 2.08 0 0.0 1.0 2.00 14.0 ▇▂▁▁▁
tackles_interceptions 3 1.00 4.63 5.08 0 1.0 3.0 7.00 35.0 ▇▂▁▁▁
clearances 3 1.00 3.55 5.02 0 0.0 2.0 5.00 37.0 ▇▁▁▁▁
errors 3 1.00 0.06 0.27 0 0.0 0.0 0.00 3.0 ▇▁▁▁▁
skim(player_gca)
Data summary
Name player_gca
Number of rows 680
Number of columns 22
_______________________
Column type frequency:
character 4
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983 1992.00 1995.00 1998.00 2004.0 ▁▅▇▇▃
minutes_90s 0 1.00 2.12 1.64 0 0.80 1.90 3.00 7.7 ▇▆▂▁▁
sca 3 1.00 3.76 4.89 0 1.00 2.00 5.00 46.0 ▇▁▁▁▁
sca_per90 5 0.99 2.09 3.09 0 0.33 1.46 2.81 45.0 ▇▁▁▁▁
sca_passes_live 3 1.00 2.79 3.53 0 0.00 2.00 4.00 28.0 ▇▁▁▁▁
sca_passes_dead 3 1.00 0.32 0.98 0 0.00 0.00 0.00 12.0 ▇▁▁▁▁
sca_dribbles 3 1.00 0.17 0.57 0 0.00 0.00 0.00 7.0 ▇▁▁▁▁
sca_shots 3 1.00 0.24 0.64 0 0.00 0.00 0.00 8.0 ▇▁▁▁▁
sca_fouled 3 1.00 0.19 0.52 0 0.00 0.00 0.00 5.0 ▇▁▁▁▁
sca_defense 3 1.00 0.05 0.26 0 0.00 0.00 0.00 3.0 ▇▁▁▁▁
gca 3 1.00 0.43 0.94 0 0.00 0.00 1.00 11.0 ▇▁▁▁▁
gca_per90 5 0.99 0.24 1.18 0 0.00 0.00 0.19 22.5 ▇▁▁▁▁
gca_passes_live 3 1.00 0.32 0.74 0 0.00 0.00 0.00 7.0 ▇▁▁▁▁
gca_passes_dead 3 1.00 0.02 0.13 0 0.00 0.00 0.00 1.0 ▇▁▁▁▁
gca_dribbles 3 1.00 0.02 0.16 0 0.00 0.00 0.00 2.0 ▇▁▁▁▁
gca_shots 3 1.00 0.03 0.22 0 0.00 0.00 0.00 4.0 ▇▁▁▁▁
gca_fouled 3 1.00 0.04 0.19 0 0.00 0.00 0.00 1.0 ▇▁▁▁▁
gca_defense 3 1.00 0.00 0.04 0 0.00 0.00 0.00 1.0 ▇▁▁▁▁
skim(player_keepers)
Data summary
Name player_keepers
Number of rows 41
Number of columns 25
_______________________
Column type frequency:
character 5
numeric 20
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 5 22 0 41 0
position 0 1 2 2 0 1 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 41 0
club 0 1 4 15 0 40 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1991.46 3.94 1985.0 1988.0 1992.00 1994 1999.00 ▆▃▇▃▃
gk_games 0 1.00 3.20 1.60 1.0 2.0 3.00 4 7.00 ▇▇▅▂▂
gk_games_starts 0 1.00 3.12 1.69 0.0 2.0 3.00 4 7.00 ▃▂▇▁▂
gk_minutes 0 1.00 288.02 163.46 11.0 175.0 270.00 360 690.00 ▅▇▃▂▂
minutes_90s 0 1.00 3.20 1.82 0.1 1.9 3.00 4 7.70 ▅▇▃▂▂
gk_goals_against 0 1.00 4.20 2.64 0.0 2.0 4.00 6 11.00 ▇▆▆▆▁
gk_goals_against_per90 0 1.00 1.42 0.98 0.0 0.8 1.04 2 4.79 ▇▇▃▁▁
gk_shots_on_target_against 0 1.00 12.24 7.58 0.0 6.0 11.00 18 31.00 ▇▇▅▅▁
gk_saves 0 1.00 7.98 5.71 0.0 4.0 7.00 11 24.00 ▇▇▃▂▁
gk_save_pct 1 0.98 67.72 14.20 40.0 54.5 66.70 80 100.00 ▃▃▇▆▁
gk_wins 0 1.00 1.17 1.16 0.0 0.0 1.00 2 5.00 ▇▂▁▁▁
gk_ties 0 1.00 0.73 0.87 0.0 0.0 1.00 1 4.00 ▇▇▂▁▁
gk_losses 0 1.00 1.15 0.79 0.0 1.0 1.00 2 3.00 ▃▇▁▆▁
gk_clean_sheets 0 1.00 0.98 0.99 0.0 0.0 1.00 2 3.00 ▇▅▁▅▂
gk_clean_sheets_pct 1 0.98 28.75 28.87 0.0 0.0 30.95 50 100.00 ▇▃▃▁▁
gk_pens_att 0 1.00 0.56 0.84 0.0 0.0 0.00 1 4.00 ▇▅▁▁▁
gk_pens_allowed 0 1.00 0.41 0.67 0.0 0.0 0.00 1 3.00 ▇▃▁▁▁
gk_pens_saved 0 1.00 0.12 0.40 0.0 0.0 0.00 0 2.00 ▇▁▁▁▁
gk_pens_missed 0 1.00 0.02 0.16 0.0 0.0 0.00 0 1.00 ▇▁▁▁▁
gk_pens_save_pct 24 0.41 20.59 39.76 0.0 0.0 0.00 0 100.00 ▇▁▁▁▂
skim(player_keepersadv)
Data summary
Name player_keepersadv
Number of rows 41
Number of columns 31
_______________________
Column type frequency:
character 4
numeric 27
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 5 22 0 41 0
position 0 1 2 2 0 1 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 41 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1991.46 3.94 1985.00 1988.00 1992.00 1994.00 1999.00 ▆▃▇▃▃
minutes_90s 0 1.00 3.20 1.82 0.10 1.90 3.00 4.00 7.70 ▅▇▃▂▂
gk_goals_against 0 1.00 4.20 2.64 0.00 2.00 4.00 6.00 11.00 ▇▆▆▆▁
gk_pens_allowed 0 1.00 0.41 0.67 0.00 0.00 0.00 1.00 3.00 ▇▃▁▁▁
gk_free_kick_goals_against 0 1.00 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
gk_corner_kick_goals_against 0 1.00 0.32 0.52 0.00 0.00 0.00 1.00 2.00 ▇▁▃▁▁
gk_own_goals_against 0 1.00 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
gk_psxg 0 1.00 4.21 2.61 0.00 2.10 3.70 6.00 10.50 ▇▇▅▅▂
gk_psnpxg_per_shot_on_target_against 1 0.98 0.30 0.08 0.10 0.25 0.31 0.37 0.44 ▂▂▆▇▅
gk_psxg_net 0 1.00 0.06 1.35 -2.30 -0.60 0.00 0.80 3.50 ▅▇▆▃▂
gk_psxg_net_per90 0 1.00 -0.04 0.50 -1.49 -0.31 -0.01 0.29 0.70 ▂▁▆▇▆
gk_passes_completed_launched 0 1.00 13.24 9.97 1.00 6.00 11.00 16.00 40.00 ▇▇▂▁▂
gk_passes_launched 0 1.00 37.15 25.26 3.00 16.00 30.00 52.00 100.00 ▇▇▆▂▂
gk_passes_pct_launched 0 1.00 37.19 13.02 8.30 30.20 36.60 42.30 75.00 ▂▇▇▂▂
gk_passes 0 1.00 78.32 44.41 3.00 47.00 74.00 109.00 178.00 ▆▇▇▅▃
gk_passes_throws 0 1.00 13.93 9.17 0.00 7.00 13.00 17.00 46.00 ▆▇▂▁▁
gk_pct_passes_launched 0 1.00 35.55 14.91 7.20 25.70 36.00 47.90 66.70 ▅▆▇▆▂
gk_passes_length_avg 0 1.00 33.44 6.09 22.00 29.20 33.20 38.30 48.30 ▆▇▇▇▂
gk_goal_kicks 0 1.00 23.61 13.37 2.00 14.00 21.00 33.00 56.00 ▆▇▆▃▂
gk_pct_goal_kicks_launched 0 1.00 46.55 27.43 0.00 26.90 47.20 66.70 100.00 ▅▇▅▆▂
gk_goal_kick_length_avg 0 1.00 40.89 12.67 16.90 30.10 41.00 50.00 67.50 ▃▇▅▇▂
gk_crosses 0 1.00 40.44 23.14 0.00 22.00 38.00 55.00 99.00 ▆▇▆▃▂
gk_crosses_stopped 0 1.00 2.37 2.67 0.00 0.00 2.00 3.00 12.00 ▇▃▁▁▁
gk_crosses_stopped_pct 1 0.98 5.11 4.86 0.00 0.00 4.75 7.15 18.20 ▇▇▃▁▂
gk_def_actions_outside_pen_area 0 1.00 2.98 3.06 0.00 0.00 2.00 5.00 14.00 ▇▃▂▁▁
gk_def_actions_outside_pen_area_per90 0 1.00 0.87 0.91 0.00 0.00 0.67 1.25 4.67 ▇▅▁▁▁
gk_avg_distance_def_actions 1 0.98 13.11 3.75 4.70 11.23 12.95 15.27 22.00 ▂▅▇▅▂
skim(player_misc)
Data summary
Name player_misc
Number of rows 680
Number of columns 22
_______________________
Column type frequency:
character 4
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983 1992.0 1995.0 1998.0 2004.0 ▁▅▇▇▃
minutes_90s 0 1.00 2.12 1.64 0 0.8 1.9 3.0 7.7 ▇▆▂▁▁
cards_yellow 0 1.00 0.33 0.57 0 0.0 0.0 1.0 3.0 ▇▃▁▁▁
cards_red 0 1.00 0.01 0.08 0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
cards_yellow_red 0 1.00 0.00 0.07 0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
fouls 0 1.00 2.35 2.61 0 0.0 2.0 3.0 17.0 ▇▂▁▁▁
fouled 0 1.00 2.24 2.89 0 0.0 1.0 3.0 22.0 ▇▁▁▁▁
offsides 0 1.00 0.37 0.86 0 0.0 0.0 0.0 7.0 ▇▁▁▁▁
crosses 0 1.00 3.20 5.56 0 0.0 1.0 4.0 40.0 ▇▁▁▁▁
interceptions 0 1.00 1.57 2.08 0 0.0 1.0 2.0 14.0 ▇▂▁▁▁
tackles_won 0 1.00 1.76 2.33 0 0.0 1.0 3.0 17.0 ▇▁▁▁▁
pens_won 3 1.00 0.03 0.17 0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
pens_conceded 3 1.00 0.03 0.18 0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
own_goals 0 1.00 0.00 0.05 0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
ball_recoveries 3 1.00 9.58 9.21 0 3.0 7.0 14.0 57.0 ▇▃▁▁▁
aerials_won 3 1.00 2.55 3.34 0 0.0 1.0 4.0 21.0 ▇▂▁▁▁
aerials_lost 3 1.00 2.55 2.90 0 0.0 2.0 4.0 17.0 ▇▂▁▁▁
aerials_won_pct 99 0.85 47.62 33.03 0 20.0 50.0 66.7 100.0 ▇▅▇▅▆
skim(player_passing)
Data summary
Name player_passing
Number of rows 680
Number of columns 29
_______________________
Column type frequency:
character 4
numeric 25
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983.0 1992.00 1995.00 1998.00 2004.0 ▁▅▇▇▃
minutes_90s 0 1.00 2.12 1.64 0.0 0.80 1.90 3.00 7.7 ▇▆▂▁▁
passes_completed 3 1.00 82.47 90.05 0.0 20.00 52.00 110.00 642.0 ▇▂▁▁▁
passes 3 1.00 101.63 102.48 0.0 28.00 69.00 141.00 689.0 ▇▂▁▁▁
passes_pct 6 0.99 76.86 12.69 0.0 69.20 78.55 86.00 100.0 ▁▁▂▇▇
passes_total_distance 3 1.00 1444.41 1625.59 0.0 301.00 841.00 2017.00 12636.0 ▇▂▁▁▁
passes_progressive_distance 3 1.00 484.30 597.90 0.0 70.00 255.00 676.00 3349.0 ▇▂▁▁▁
passes_completed_short 3 1.00 37.77 41.55 0.0 9.00 24.00 51.00 281.0 ▇▂▁▁▁
passes_short 3 1.00 42.24 44.89 0.0 11.00 28.00 57.00 306.0 ▇▂▁▁▁
passes_pct_short 15 0.98 86.95 12.26 0.0 81.80 89.10 95.30 100.0 ▁▁▁▂▇
passes_completed_medium 3 1.00 34.68 43.76 0.0 7.00 19.00 47.00 416.0 ▇▁▁▁▁
passes_medium 3 1.00 39.75 46.84 0.0 9.00 24.00 56.00 435.0 ▇▁▁▁▁
passes_pct_medium 16 0.98 81.54 18.58 0.0 73.88 85.65 94.70 100.0 ▁▁▁▃▇
passes_completed_long 3 1.00 7.99 10.05 0.0 1.00 4.00 11.00 55.0 ▇▂▁▁▁
passes_long 3 1.00 14.29 17.29 0.0 2.00 8.00 20.00 115.0 ▇▂▁▁▁
passes_pct_long 82 0.88 55.55 27.04 0.0 40.08 55.60 71.92 100.0 ▃▃▇▇▅
assists 0 1.00 0.18 0.49 0.0 0.00 0.00 0.00 3.0 ▇▁▁▁▁
xg_assist 3 1.00 0.17 0.32 0.0 0.00 0.00 0.20 3.1 ▇▁▁▁▁
pass_xa 3 1.00 0.15 0.29 0.0 0.00 0.10 0.20 3.6 ▇▁▁▁▁
xg_assist_net 3 1.00 0.01 0.36 -1.3 -0.10 0.00 0.00 2.1 ▁▇▁▁▁
assisted_shots 3 1.00 1.59 2.39 0.0 0.00 1.00 2.00 21.0 ▇▁▁▁▁
passes_into_final_third 3 1.00 5.93 8.25 0.0 1.00 3.00 8.00 71.0 ▇▁▁▁▁
passes_into_penalty_area 3 1.00 1.33 2.09 0.0 0.00 1.00 2.00 18.0 ▇▁▁▁▁
crosses_into_penalty_area 3 1.00 0.37 0.83 0.0 0.00 0.00 0.00 6.0 ▇▁▁▁▁
progressive_passes 3 1.00 5.14 6.51 0.0 1.00 3.00 7.00 61.0 ▇▁▁▁▁
skim(player_passing_types)
Data summary
Name player_passing_types
Number of rows 680
Number of columns 21
_______________________
Column type frequency:
character 4
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1 1994.91 4.16 1983 1992.0 1995.0 1998 2004.0 ▁▅▇▇▃
minutes_90s 0 1 2.12 1.64 0 0.8 1.9 3 7.7 ▇▆▂▁▁
passes 3 1 101.63 102.48 0 28.0 69.0 141 689.0 ▇▂▁▁▁
passes_live 3 1 92.04 96.25 0 24.0 61.0 121 678.0 ▇▂▁▁▁
passes_dead 3 1 9.22 12.65 0 1.0 4.0 12 73.0 ▇▁▁▁▁
passes_free_kicks 3 1 2.60 3.95 0 0.0 1.0 4 33.0 ▇▁▁▁▁
through_balls 3 1 0.22 0.59 0 0.0 0.0 0 4.0 ▇▁▁▁▁
passes_switches 3 1 0.88 1.70 0 0.0 0.0 1 17.0 ▇▁▁▁▁
crosses 0 1 3.20 5.56 0 0.0 1.0 4 40.0 ▇▁▁▁▁
throw_ins 3 1 3.84 8.86 0 0.0 0.0 2 67.0 ▇▁▁▁▁
corner_kicks 3 1 0.84 2.85 0 0.0 0.0 0 28.0 ▇▁▁▁▁
corner_kicks_in 3 1 0.35 1.31 0 0.0 0.0 0 13.0 ▇▁▁▁▁
corner_kicks_out 3 1 0.30 1.23 0 0.0 0.0 0 11.0 ▇▁▁▁▁
corner_kicks_straight 3 1 0.02 0.32 0 0.0 0.0 0 8.0 ▇▁▁▁▁
passes_completed 3 1 82.47 90.05 0 20.0 52.0 110 642.0 ▇▂▁▁▁
passes_offsides 3 1 0.38 0.74 0 0.0 0.0 1 5.0 ▇▁▁▁▁
passes_blocked 3 1 1.79 2.23 0 0.0 1.0 3 16.0 ▇▁▁▁▁
skim(player_playingtime)
Data summary
Name player_playingtime
Number of rows 829
Number of columns 27
_______________________
Column type frequency:
character 4
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 829 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 762 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.99 4.27 1982.00 1992.00 1995.00 1998.00 2004.00 ▁▃▇▇▃
games 0 1.00 2.41 1.78 0.00 1.00 2.00 3.00 7.00 ▇▃▇▁▁
minutes 149 0.82 191.19 147.77 1.00 68.00 173.00 270.00 690.00 ▇▆▂▁▁
minutes_per_game 149 0.82 59.62 28.91 1.00 35.00 65.00 89.00 120.00 ▃▅▃▇▁
minutes_pct 148 0.82 51.68 34.47 0.00 18.80 47.40 86.40 100.00 ▇▅▃▃▇
minutes_90s 148 0.82 2.12 1.64 0.00 0.80 1.90 3.00 7.70 ▇▆▂▁▁
games_starts 0 1.00 1.70 1.78 0.00 0.00 1.00 3.00 7.00 ▇▂▅▁▁
minutes_per_start 309 0.63 80.05 14.29 12.00 72.00 86.00 90.00 120.00 ▁▁▃▇▁
games_complete 0 1.00 0.99 1.48 0.00 0.00 0.00 2.00 7.00 ▇▁▂▁▁
games_subs 0 1.00 0.71 1.06 0.00 0.00 0.00 1.00 6.00 ▇▁▁▁▁
minutes_per_sub 486 0.41 21.96 13.19 1.00 12.00 21.00 30.00 78.00 ▇▇▃▁▁
unused_subs 0 1.00 1.49 1.65 0.00 0.00 1.00 3.00 7.00 ▇▂▃▁▁
points_per_game 148 0.82 1.24 0.86 0.00 0.50 1.33 1.80 3.00 ▇▃▇▃▂
on_goals_for 148 0.82 2.79 3.19 0.00 0.00 2.00 4.00 16.00 ▇▂▁▁▁
on_goals_against 148 0.82 2.79 2.41 0.00 1.00 2.00 4.00 11.00 ▇▃▂▂▁
plus_minus 148 0.82 0.00 2.92 -8.00 -2.00 0.00 1.00 11.00 ▁▇▇▂▁
plus_minus_per90 149 0.82 -0.40 2.88 -45.00 -1.00 0.00 0.72 22.50 ▁▁▁▇▁
plus_minus_wowy 240 0.71 -0.07 4.07 -44.25 -1.38 0.00 1.34 43.98 ▁▁▇▁▁
on_xg_for 152 0.82 2.76 2.67 0.00 0.80 2.30 3.80 15.20 ▇▃▁▁▁
on_xg_against 152 0.82 2.75 2.30 0.00 1.00 2.30 3.90 11.00 ▇▅▂▁▁
xg_plus_minus 152 0.82 0.00 2.49 -8.70 -1.10 -0.10 0.80 10.60 ▁▃▇▁▁
xg_plus_minus_per90 154 0.81 -0.11 3.39 -46.61 -0.76 -0.14 0.48 46.61 ▁▁▇▁▁
xg_plus_minus_wowy 233 0.72 -0.05 4.57 -44.55 -0.90 -0.05 0.83 44.55 ▁▁▇▁▁
skim(player_possession)
Data summary
Name player_possession
Number of rows 680
Number of columns 20
_______________________
Column type frequency:
character 4
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983 1992.0 1995.0 1998.00 2004.0 ▁▅▇▇▃
minutes_90s 0 1.00 2.12 1.64 0 0.8 1.9 3.00 7.7 ▇▆▂▁▁
touches 3 1.00 121.41 114.23 0 34.0 88.0 171.00 710.0 ▇▃▁▁▁
touches_def_pen_area 3 1.00 11.63 25.95 0 1.0 3.0 10.00 224.0 ▇▁▁▁▁
touches_def_3rd 3 1.00 37.68 47.73 0 5.0 17.0 53.00 297.0 ▇▂▁▁▁
touches_mid_3rd 3 1.00 58.28 67.50 0 12.0 38.0 79.00 518.0 ▇▁▁▁▁
touches_att_3rd 3 1.00 26.54 31.80 0 5.0 15.0 37.00 239.0 ▇▁▁▁▁
touches_att_pen_area 3 1.00 3.56 5.25 0 0.0 2.0 5.00 61.0 ▇▁▁▁▁
touches_live_ball 3 1.00 121.38 114.20 0 34.0 88.0 171.00 710.0 ▇▃▁▁▁
dribbles_completed 3 1.00 1.09 2.20 0 0.0 0.0 1.00 25.0 ▇▁▁▁▁
dribbles 3 1.00 2.90 4.87 0 0.0 1.0 4.00 52.0 ▇▁▁▁▁
dribbles_completed_pct 245 0.64 38.57 35.22 0 0.0 33.3 56.35 100.0 ▇▃▅▂▃
miscontrols 3 1.00 2.92 3.38 0 0.0 2.0 4.00 21.0 ▇▂▁▁▁
dispossessed 3 1.00 1.75 2.44 0 0.0 1.0 3.00 24.0 ▇▁▁▁▁
passes_received 3 1.00 81.34 84.41 0 22.0 58.0 111.00 598.0 ▇▂▁▁▁
progressive_passes_received 3 1.00 4.99 6.71 0 0.0 2.0 7.00 58.0 ▇▁▁▁▁
skim(player_stats)
Data summary
Name player_stats
Number of rows 680
Number of columns 31
_______________________
Column type frequency:
character 5
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
player 0 1 4 26 0 680 0
position 0 1 2 2 0 4 0
team 0 1 5 14 0 32 0
age 0 1 6 6 0 634 0
club 1 1 3 33 0 254 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birth_year 0 1.00 1994.91 4.16 1983 1992.0 1995.00 1998.00 2004.00 ▁▅▇▇▃
games 0 1.00 2.93 1.52 1 2.0 3.00 4.00 7.00 ▇▆▃▁▂
games_starts 0 1.00 2.07 1.75 0 1.0 2.00 3.00 7.00 ▇▂▆▁▁
minutes 0 1.00 191.19 147.77 1 68.0 173.00 270.00 690.00 ▇▆▂▁▁
minutes_90s 0 1.00 2.12 1.64 0 0.8 1.90 3.00 7.70 ▇▆▂▁▁
goals 0 1.00 0.25 0.70 0 0.0 0.00 0.00 8.00 ▇▁▁▁▁
assists 0 1.00 0.18 0.49 0 0.0 0.00 0.00 3.00 ▇▁▁▁▁
goals_pens 0 1.00 0.22 0.60 0 0.0 0.00 0.00 6.00 ▇▁▁▁▁
pens_made 0 1.00 0.03 0.21 0 0.0 0.00 0.00 4.00 ▇▁▁▁▁
pens_att 0 1.00 0.03 0.27 0 0.0 0.00 0.00 5.00 ▇▁▁▁▁
cards_yellow 0 1.00 0.33 0.57 0 0.0 0.00 1.00 3.00 ▇▃▁▁▁
cards_red 0 1.00 0.01 0.08 0 0.0 0.00 0.00 1.00 ▇▁▁▁▁
goals_per90 0 1.00 0.11 0.34 0 0.0 0.00 0.00 3.46 ▇▁▁▁▁
assists_per90 0 1.00 0.10 0.88 0 0.0 0.00 0.00 22.50 ▇▁▁▁▁
goals_assists_per90 0 1.00 0.21 0.95 0 0.0 0.00 0.14 22.50 ▇▁▁▁▁
goals_pens_per90 0 1.00 0.10 0.33 0 0.0 0.00 0.00 3.46 ▇▁▁▁▁
goals_assists_pens_per90 0 1.00 0.20 0.95 0 0.0 0.00 0.00 22.50 ▇▁▁▁▁
xg 3 1.00 0.26 0.54 0 0.0 0.10 0.30 6.60 ▇▁▁▁▁
npxg 3 1.00 0.23 0.42 0 0.0 0.10 0.30 3.60 ▇▁▁▁▁
xg_assist 3 1.00 0.17 0.32 0 0.0 0.00 0.20 3.10 ▇▁▁▁▁
npxg_xg_assist 3 1.00 0.40 0.61 0 0.0 0.20 0.50 5.10 ▇▁▁▁▁
xg_per90 5 0.99 0.13 0.28 0 0.0 0.04 0.13 2.86 ▇▁▁▁▁
xg_assist_per90 5 0.99 0.11 0.59 0 0.0 0.01 0.10 14.37 ▇▁▁▁▁
xg_xg_assist_per90 5 0.99 0.24 0.66 0 0.0 0.09 0.27 14.37 ▇▁▁▁▁
npxg_per90 5 0.99 0.12 0.27 0 0.0 0.04 0.13 2.86 ▇▁▁▁▁
npxg_xg_assist_per90 5 0.99 0.23 0.65 0 0.0 0.09 0.26 14.37 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

Data is pulled from the UNdata website, providing official statistics compiled from the UN data system.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Data was first compiled in 2010 from each country’s official records submitted to questionnaires sent out annually to each national statistical office.

  • Write a brief description of the observations.

Includes country, population growth rate, fertility rate, mortality rate, and life expectancy at birth.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What social indicators are most closely correlated to a country’s decreasing population growth rate?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Researching different indicators commonly associated with the development of a country, and how they affect their population growth rate. For example, the country’s GDP, female education rates, labor market indicators, crime rates, etc. We predict an increase in crime rates and an increase in female education rates will be associated with a decrease in population growth rate. This is important to study because it could provide insight into how bolster or hinder a country’s growth rate which can be valauable for government and NGOs.

  • Identify the types of variables in your research question. Categorical? Quantitative?

The growth rate is quantitative, while the indicators are categorical.

Glimpse of data

# add code here
pop_growth <- read.csv("data/UNpopulation_growth.csv")
labor_market <- read.csv("data/UNLabor_Market.csv")

skim(pop_growth)
Data summary
Name pop_growth
Number of rows 6655
Number of columns 7
_______________________
Column type frequency:
character 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
T04 0 1 1 19 0 265 0
Population.growth.and.indicators.of.fertility.and.mortality 0 1 0 29 1 265 0
X 0 1 4 4 0 5 0
X.1 0 1 6 56 0 8 0
X.2 0 1 1 5 0 1067 0
X.3 0 1 0 364 4110 83 0
X.4 0 1 6 297 0 5 0
skim(labor_market)
Data summary
Name labor_market
Number of rows 2194
Number of columns 20
_______________________
Column type frequency:
character 15
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country..code. 5 1 0 3 20 174 0
Country 0 1 0 42 1 194 0
Year 0 1 0 106 2 78 0
Sex 0 1 0 2 20 2 0
Youth.unemployed…000. 0 1 0 10 30 1836 0
Youth.labour.force…000. 0 1 0 11 354 1783 0
Adult.unemployed…000. 0 1 0 10 44 1939 0
Adult.labour.force…000. 0 1 0 12 357 1821 0
Total.unemployed…000. 0 1 0 10 42 2014 0
Youth.population…000. 0 1 0 12 485 1673 0
Repository..code. 0 1 0 9 20 8 0
Type.of.source..code. 0 1 0 4 20 9 0
Coverage..code. 0 1 0 1 20 3 0
Age 0 1 0 26 20 21 0
Notes 0 1 0 298 702 168 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Youth.unemployment.rate…. 345 0.84 18.05 10.97 0.7 9.8 15.9 23.70 70.9 ▇▇▂▁▁
Ratio.of.youth.unemployment.rate.to.adult.unemployment.rate 363 0.83 2.96 1.28 0.6 2.3 2.7 3.30 16.4 ▇▁▁▁▁
Share.of.youth.unemployed.in.total.unemployed…. 42 0.98 38.37 13.47 7.0 27.7 37.5 47.73 83.4 ▂▇▇▃▁
Share.of.youth.unemployed.in.youth.population…. 486 0.78 8.19 4.60 0.5 4.8 7.3 10.50 30.8 ▇▇▂▁▁
Adult.unemployment.rate…. 352 0.84 6.80 4.90 0.3 3.6 5.7 8.60 37.8 ▇▃▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

The first group of data was obtained by the Wall Street Journal based on data from Payscale, Inc. The data was last updated in 2017. There are three datasets. The first dataset has one observation per degree with salary information on each degree. The other two datasets give salary information per college and degree. (https://www.kaggle.com/datasets/wsj/college-salaries?resource=download)

The second dataset is from the Department of Education Statistics and was collected from 1970-2011. Each observation pertains to a certain year, and contains the percentage of women awarded bachelors degrees in each major for that year.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How has the student gender ratio within different fields of study/college majors changed over the years, and is this related to the median earnings for each field of study?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

The demographics of college students is a good indication of the interests of young people, and variables such as gender ratio and median earnings within different fields of study can give us insight into the young workforce. I hypothesize that over recent years, fields of study that have traditionally been attributed to men or women have become less skewed in gender proportion.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Field of study - categorical

Gender ratio - quantitative

Median earnings - quantitative

Glimpse of data

degrees <- read.csv("data/college/degrees-that-pay-back.csv")
degrees |> glimpse()
Rows: 50
Columns: 8
$ Undergraduate.Major                               <chr> "Accounting", "Aeros…
$ Starting.Median.Salary                            <chr> "$46,000.00", "$57,7…
$ Mid.Career.Median.Salary                          <chr> "$77,100.00", "$101,…
$ Percent.change.from.Starting.to.Mid.Career.Salary <dbl> 67.6, 75.0, 68.8, 67…
$ Mid.Career.10th.Percentile.Salary                 <chr> "$42,200.00", "$64,3…
$ Mid.Career.25th.Percentile.Salary                 <chr> "$56,100.00", "$82,1…
$ Mid.Career.75th.Percentile.Salary                 <chr> "$108,000.00", "$127…
$ Mid.Career.90th.Percentile.Salary                 <chr> "$152,000.00", "$161…
salaries_type <- read.csv("data/college/salaries-by-college-type.csv")
salaries_type |> glimpse()
Rows: 269
Columns: 8
$ School.Name                       <chr> "Massachusetts Institute of Technolo…
$ School.Type                       <chr> "Engineering", "Engineering", "Engin…
$ Starting.Median.Salary            <chr> "$72,200.00", "$75,500.00", "$71,800…
$ Mid.Career.Median.Salary          <chr> "$126,000.00", "$123,000.00", "$122,…
$ Mid.Career.10th.Percentile.Salary <chr> "$76,800.00", "N/A", "N/A", "$66,800…
$ Mid.Career.25th.Percentile.Salary <chr> "$99,200.00", "$104,000.00", "$96,00…
$ Mid.Career.75th.Percentile.Salary <chr> "$168,000.00", "$161,000.00", "$180,…
$ Mid.Career.90th.Percentile.Salary <chr> "$220,000.00", "N/A", "N/A", "$190,0…
salaries_region <- read.csv("data/college/salaries-by-region.csv")
salaries_region |> glimpse()
Rows: 320
Columns: 8
$ School.Name                       <chr> "Stanford University", "California I…
$ Region                            <chr> "California", "California", "Califor…
$ Starting.Median.Salary            <chr> "$70,400.00", "$75,500.00", "$71,800…
$ Mid.Career.Median.Salary          <chr> "$129,000.00", "$123,000.00", "$122,…
$ Mid.Career.10th.Percentile.Salary <chr> "$68,400.00", "N/A", "N/A", "$59,500…
$ Mid.Career.25th.Percentile.Salary <chr> "$93,100.00", "$104,000.00", "$96,00…
$ Mid.Career.75th.Percentile.Salary <chr> "$184,000.00", "$161,000.00", "$180,…
$ Mid.Career.90th.Percentile.Salary <chr> "$257,000.00", "N/A", "N/A", "$201,0…
college_women <- read.csv("data/college/percent-bachelors-degrees-women-usa.csv")
college_women |> glimpse()
Rows: 42
Columns: 18
$ Year                          <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976…
$ Agriculture                   <dbl> 4.229798, 5.452797, 7.420710, 9.653602, …
$ Architecture                  <dbl> 11.92101, 12.00311, 13.21459, 14.79161, …
$ Art.and.Performance           <dbl> 59.7, 59.9, 60.4, 60.2, 61.9, 60.9, 61.3…
$ Biology                       <dbl> 29.08836, 29.39440, 29.81022, 31.14791, …
$ Business                      <dbl> 9.064439, 9.503187, 10.558962, 12.804602…
$ Communications.and.Journalism <dbl> 35.3, 35.5, 36.6, 38.4, 40.5, 41.5, 44.3…
$ Computer.Science              <dbl> 13.6, 13.6, 14.9, 16.4, 18.9, 19.8, 23.9…
$ Education                     <dbl> 74.53533, 74.14920, 73.55452, 73.50181, …
$ Engineering                   <dbl> 0.8, 1.0, 1.2, 1.6, 2.2, 3.2, 4.5, 6.8, …
$ English                       <dbl> 65.57092, 64.55649, 63.66426, 62.94150, …
$ Foreign.Languages             <dbl> 73.8, 73.9, 74.6, 74.9, 75.3, 75.0, 74.4…
$ Health.Professions            <dbl> 77.1, 75.5, 76.9, 77.4, 77.9, 78.9, 79.2…
$ Math.and.Statistics           <dbl> 38.0, 39.0, 40.2, 40.9, 41.8, 40.7, 41.5…
$ Physical.Sciences             <dbl> 13.8, 14.9, 14.8, 16.5, 18.2, 19.1, 20.0…
$ Psychology                    <dbl> 44.4, 46.2, 47.6, 50.4, 52.6, 54.5, 56.9…
$ Public.Administration         <dbl> 68.4, 65.5, 62.6, 64.3, 66.1, 63.0, 65.6…
$ Social.Sciences.and.History   <dbl> 36.8, 36.2, 36.1, 36.4, 37.3, 37.7, 39.2…
skim(degrees)
Data summary
Name degrees
Number of rows 50
Number of columns 8
_______________________
Column type frequency:
character 7
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Undergraduate.Major 0 1 4 36 0 50 0
Starting.Median.Salary 0 1 10 10 0 43 0
Mid.Career.Median.Salary 0 1 10 11 0 49 0
Mid.Career.10th.Percentile.Salary 0 1 10 10 0 45 0
Mid.Career.25th.Percentile.Salary 0 1 10 10 0 48 0
Mid.Career.75th.Percentile.Salary 0 1 10 11 0 44 0
Mid.Career.90th.Percentile.Salary 0 1 10 11 0 43 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Percent.change.from.Starting.to.Mid.Career.Salary 0 1 69.27 17.91 23.4 59.12 67.8 82.42 103.5 ▁▂▇▃▂
skim(salaries_type)
Data summary
Name salaries_type
Number of rows 269
Number of columns 8
_______________________
Column type frequency:
character 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
School.Name 0 1 12 67 0 249 0
School.Type 0 1 5 12 0 5 0
Starting.Median.Salary 0 1 10 10 0 145 0
Mid.Career.Median.Salary 0 1 10 11 0 168 0
Mid.Career.10th.Percentile.Salary 0 1 3 10 0 142 0
Mid.Career.25th.Percentile.Salary 0 1 10 11 0 178 0
Mid.Career.75th.Percentile.Salary 0 1 10 11 0 110 0
Mid.Career.90th.Percentile.Salary 0 1 3 11 0 99 0
skim(salaries_region)
Data summary
Name salaries_region
Number of rows 320
Number of columns 8
_______________________
Column type frequency:
character 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
School.Name 0 1 12 67 0 320 0
Region 0 1 7 12 0 5 0
Starting.Median.Salary 0 1 10 10 0 168 0
Mid.Career.Median.Salary 0 1 10 11 0 204 0
Mid.Career.10th.Percentile.Salary 0 1 3 10 0 167 0
Mid.Career.25th.Percentile.Salary 0 1 10 11 0 217 0
Mid.Career.75th.Percentile.Salary 0 1 10 11 0 130 0
Mid.Career.90th.Percentile.Salary 0 1 3 11 0 116 0
skim(college_women)
Data summary
Name college_women
Number of rows 42
Number of columns 18
_______________________
Column type frequency:
numeric 18
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1990.50 12.27 1970.00 1980.25 1990.50 2000.75 2011.00 ▇▇▇▇▇
Agriculture 0 1 33.85 12.55 4.23 30.84 33.32 45.66 50.04 ▂▂▇▅▇
Architecture 0 1 33.69 9.57 11.92 28.52 35.99 40.79 44.50 ▂▂▂▆▇
Art.and.Performance 0 1 61.10 1.31 58.60 60.20 61.30 62.00 63.40 ▅▃▇▇▃
Biology 0 1 49.43 10.09 29.09 44.31 50.97 58.68 62.17 ▃▂▃▆▇
Business 0 1 40.65 13.12 9.06 37.39 47.21 48.88 50.55 ▂▁▁▁▇
Communications.and.Journalism 0 1 56.22 8.70 35.30 55.13 59.85 62.13 64.60 ▂▁▁▂▇
Computer.Science 0 1 25.81 6.69 13.60 19.12 27.30 29.77 37.10 ▆▃▇▇▅
Education 0 1 76.36 2.21 72.17 74.99 75.94 78.62 79.62 ▅▃▆▃▇
Engineering 0 1 12.89 5.67 0.80 10.62 14.10 16.95 19.00 ▂▁▂▅▇
English 0 1 66.19 1.95 61.65 65.58 66.11 67.86 68.89 ▃▁▇▅▇
Foreign.Languages 0 1 71.72 1.93 69.00 70.12 71.15 73.88 75.30 ▇▇▃▂▆
Health.Professions 0 1 82.98 2.91 75.50 81.82 83.70 85.18 86.50 ▂▁▃▃▇
Math.and.Statistics 0 1 44.48 2.65 38.00 42.87 44.90 46.50 48.30 ▁▃▆▅▇
Physical.Sciences 0 1 31.30 9.00 13.80 24.88 32.10 40.20 42.20 ▃▂▃▃▇
Psychology 0 1 68.78 9.71 44.40 65.55 72.75 76.92 77.80 ▂▁▁▃▇
Public.Administration 0 1 76.09 5.88 62.60 74.62 77.45 81.10 82.10 ▂▁▁▆▇
Social.Sciences.and.History 0 1 45.41 4.76 36.10 43.82 45.30 49.38 51.80 ▃▁▆▂▇