Assessing Top Soccer Player Statistics

Exploratory data analysis

Research question(s)

Research question(s). State your research question (s) clearly.

Is there a correlation between the height and/or weight and specific game statistics (Including: goals scored, assists, penalties, passes, etc.)

  • Do teams with taller players (player_height) tend to have get booked more frequently ( mutate –> total_bookings = cards_yellow + cards_red + cards_redyellow)?
  • How closely player ratings correlated with goal accuracy (mutate –> goal_acc = goals_total / shots_total)?

Are some leagues better than others?

  • Are some leagues (leagues) outputting more goals than others?

  • Which team (team) with the greatest goal output (goals_total) in every league has the most penalties (i.e. mutate –> total_bookings = cards_yellow + cards_red + cards_redyellow)?

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(skimr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
soccer <- read_csv("data/top_goalscorers.csv")
New names:
Rows: 560 Columns: 61
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(16): team.name, team.logo, league.name, league.country, league.logo, l... dbl
(36): player.id...1, team.id, league.id, league.season, games.appearenc... lgl
(8): games.number, games.captain, goals.saves, dribbles.past, penalty.... date
(1): player.birth.date
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `player.id` -> `player.id...1`
• `player.id` -> `player.id...48`
soccer <- clean_names(soccer)

soccer_clean <- soccer |>
  mutate(player_height_cm = as.numeric(gsub(" cm", "", soccer$player_height)),
         player_weight_kg = as.numeric(gsub(" kg", "", soccer$player_weight))) |>
  select(-c(games_number, goals_saves, dribbles_past, penalty_won, 
            penalty_commited, player_height, player_weight, penalty_saved, 
            player_id_1, player_id_48, player_firstname, league_season))

soccer_clean
# A tibble: 560 × 51
   team_id team_name  team_logo league_id league_name league_country league_logo
     <dbl> <chr>      <chr>         <dbl> <chr>       <chr>          <chr>      
 1      47 Tottenham  https://…        39 Premier Le… England        https://me…
 2      40 Liverpool  https://…        39 Premier Le… England        https://me…
 3      33 Mancheste… https://…        39 Premier Le… England        https://me…
 4      47 Tottenham  https://…        39 Premier Le… England        https://me…
 5      40 Liverpool  https://…        39 Premier Le… England        https://me…
 6      50 Mancheste… https://…        39 Premier Le… England        https://me…
 7      40 Liverpool  https://…        39 Premier Le… England        https://me…
 8      46 Leicester  https://…        39 Premier Le… England        https://me…
 9      52 Crystal P… https://…        39 Premier Le… England        https://me…
10      50 Mancheste… https://…        39 Premier Le… England        https://me…
# ℹ 550 more rows
# ℹ 44 more variables: league_flag <chr>, games_appearences <dbl>,
#   games_lineups <dbl>, games_minutes <dbl>, games_position <chr>,
#   games_rating <dbl>, games_captain <lgl>, substitutes_in <dbl>,
#   substitutes_out <dbl>, substitutes_bench <dbl>, shots_total <dbl>,
#   shots_on <dbl>, goals_total <dbl>, goals_conceded <dbl>,
#   goals_assists <dbl>, passes_total <dbl>, passes_key <dbl>, …
glimpse(soccer_clean)
Rows: 560
Columns: 51
$ team_id               <dbl> 47, 40, 33, 47, 40, 50, 40, 46, 52, 50, 48, 46, …
$ team_name             <chr> "Tottenham", "Liverpool", "Manchester United", "…
$ team_logo             <chr> "https://media-2.api-sports.io/football/teams/47…
$ league_id             <dbl> 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, …
$ league_name           <chr> "Premier League", "Premier League", "Premier Lea…
$ league_country        <chr> "England", "England", "England", "England", "Eng…
$ league_logo           <chr> "https://media-2.api-sports.io/football/leagues/…
$ league_flag           <chr> "https://media-2.api-sports.io/flags/gb.svg", "h…
$ games_appearences     <dbl> 35, 35, 30, 37, 34, 30, 35, 25, 33, 30, 36, 35, …
$ games_lineups         <dbl> 35, 30, 27, 36, 32, 25, 27, 20, 31, 23, 34, 28, …
$ games_minutes         <dbl> 3021, 2762, 2459, 3232, 2825, 2205, 2373, 1806, …
$ games_position        <chr> "Attacker", "Attacker", "Attacker", "Attacker", …
$ games_rating          <dbl> 7.505714, 7.391428, 7.260000, 7.378378, 7.202941…
$ games_captain         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ substitutes_in        <dbl> 0, 5, 3, 1, 2, 5, 8, 5, 2, 7, 2, 7, 1, 5, 2, 2, …
$ substitutes_out       <dbl> 15, 5, 6, 1, 7, 9, 17, 5, 2, 6, 12, 14, 0, 7, 5,…
$ substitutes_bench     <dbl> 0, 5, 3, 1, 2, 8, 9, 7, 2, 14, 2, 7, 1, 8, 2, 2,…
$ shots_total           <dbl> 69, 101, 80, 97, 66, 56, 71, 47, 51, 46, 56, 54,…
$ shots_on              <dbl> 49, 60, 43, 55, 39, 31, 33, 26, 32, 28, 34, 34, …
$ goals_total           <dbl> 23, 23, 18, 17, 16, 15, 15, 15, 14, 13, 12, 12, …
$ goals_conceded        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ goals_assists         <dbl> 7, 13, 3, 9, 2, 8, 4, 2, 1, 5, 10, 8, 5, 10, 2, …
$ passes_total          <dbl> 1015, 1079, 879, 874, 1036, 1358, 744, 251, 909,…
$ passes_key            <dbl> 73, 64, 26, 53, 43, 87, 41, 18, 47, 42, 42, 49, …
$ passes_accuracy       <dbl> 25, 24, 24, 16, 23, 37, 15, 7, 22, 23, 15, 22, 1…
$ tackles_total         <dbl> 15, 17, 8, 15, 33, 32, 39, 6, 40, 23, 36, 40, 33…
$ tackles_blocks        <dbl> 6, 1, NA, 8, 1, 2, 2, NA, NA, NA, 2, 3, 2, 1, 3,…
$ tackles_interceptions <dbl> 13, 6, 2, 3, 8, 6, 6, 1, 5, 12, 21, 11, 14, 13, …
$ duels_total           <dbl> 270, 309, 217, 437, 361, 209, 384, 141, 506, 278…
$ duels_won             <dbl> 118, 100, 88, 198, 160, 101, 143, 51, 228, 117, …
$ dribbles_attempts     <dbl> 92, 124, 32, 100, 77, 55, 69, 22, 152, 112, 88, …
$ dribbles_success      <dbl> 51, 53, 20, 54, 47, 31, 27, 9, 75, 53, 50, 31, 2…
$ fouls_drawn           <dbl> 36, 23, 18, 54, 50, 27, 40, 12, 101, 37, 43, 64,…
$ fouls_committed       <dbl> 14, 12, 16, 42, 50, 21, 41, 13, 37, 19, 25, 22, …
$ cards_yellow          <dbl> 2, 1, 8, 5, 5, 2, 3, 2, 5, 1, 6, 3, 8, 4, 7, 6, …
$ cards_yellowred       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ cards_red             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ penalty_scored        <dbl> 0, 5, 3, 4, 0, 0, 0, 0, 5, 2, 0, 0, 5, 1, 1, 2, …
$ penalty_missed        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, …
$ player_name           <chr> "Son Heung-Min", "Mohamed Salah", "Cristiano Ron…
$ player_lastname       <chr> "Son", "Salah Hamed Mahrous Ghaly", "dos Santos …
$ player_age            <dbl> 31, 31, 38, 30, 31, 32, 27, 36, 31, 29, 27, 27, …
$ player_nationality    <chr> "Korea Republic", "Egypt", "Portugal", "England"…
$ player_injured        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ player_photo          <chr> "https://media-1.api-sports.io/football/players/…
$ player_birth_date     <date> 1992-07-08, 1992-06-15, 1985-02-05, 1993-07-28,…
$ player_birth_place    <chr> "Chuncheon", "Muḥāfaẓat al Gharbiyya", "Funchal"…
$ player_birth_country  <chr> "Korea Republic", "Egypt", "Portugal", "England"…
$ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ player_height_cm      <dbl> 184, 175, 187, 188, 175, 181, 178, 179, 180, 172…
$ player_weight_kg      <dbl> 77, 71, 83, 86, 69, 68, 68, 74, 66, 69, 70, 73, …

The data was first collected by making API calls from https://www.api-football.com/documentation-v3 for various soccer leagues. We did this my making a GET request to their top scorers endpoint, supplying parameters for each of 27 top leagues in 2021. From this we collected 20 goal scorers for each respective league as we specified per the query string parameters. From this point, we used the jsonlite library to flatten the JSON API responses into a tibble and then exported the tibble to a CSV file.

Once we had our data stored in a CSV file, we wrote code to load it in R and then began the cleaning process. The main reason why we stored the data to a CSV rather than dynamically making API calls is due to the rate limits and fees associated with using the API. For the main cleaning process, we utilized the clean_names function from the janitor library to make sure our variable names conform to R naming conventions. After this, we dropped the columns we would not be using by employing a select statement on our data frame. Originally, our data set had columns that were relevant to goalkeepers, defenders, and other players that are not primarily goalscorers. We also noticed that our height and weight columns were not sanitized and in character form versus numeric, thus we sanitized these columns added the units to the variable name and dropped the old columns. Lastly, we made sure to remove any columns that appeared to be duplicates like player_id_48, player_id_1, and player_firstname.

Data description

Have an initial draft of your data description section. Your data description should be about your analysis-ready data.

  • What are the observations (rows) and the attributes (columns)?

    • There are 560 observations and 51 attributes (columns). The dataset contains player statistics (ex/ number of appearances, number of minutes played, number of yellow or red cards attained) for the top 20 scorers from the top 27 football leagues for the 2021 league season.
  • Why was this dataset created?

    • This dataset was created to rank and compare the top scoring football players across top leagues. This could be helpful for football fans, sports betters, or data scientists!
  • Who funded the creation of the dataset?

    • N/a (Most stats are automatically recorded, but were probably scraped from different summary sites aggregated into one site)
  • What processes might have influenced what data was observed and recorded and what was not?

    • It may have been influenced by referee rulings, as some scores may not have counted in the final game statistics.
  • What preprocessing was done, and how did the data come to be in the form that you are using?

    • (see above in Data collection and cleaning)
  • If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

    • The people involved are football players, who are aware of data collection, a this occurs during and after all of their games.

    • They probably expected the data to be used for:

      • general comparison (past/future versions of themselves, comparing their team to another, or between leagues)

      • sports reporting (to football fans)

      • MVP decisions

Data limitations

  • Data is only from the 2021 season. Therefore, we cannot make comparisons within groups (player, team, or league) over time.

  • There are no goalies in the dataset (best scorers i.e. offensive players) therefore, we are unable to assess a team’s defense, which is important when considering how good a team is!

Exploratory data analysis

Perform an (initial) exploratory data analysis.

# height vs bookings  
soccer_clean |>
  mutate(
    total_bookings = cards_yellow + cards_red + cards_yellowred
  ) |>
ggplot(mapping = aes(x = player_height_cm, y = total_bookings)) +
  geom_jitter() +
  geom_smooth() +
  labs(
    x = "Player Height (in cm)",
    y = "Bookings",
    title = "Player Height vs. Bookings"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 34 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 34 rows containing missing values (`geom_point()`).

# player rating vs goal accuracy
# would need to transform, and/or implement axis limits
soccer_clean |>
  mutate(
    goal_acc = goals_total/shots_total
  ) |>
ggplot(mapping = aes(x = games_rating, y = goal_acc)) +
  geom_jitter() +
  geom_smooth() +
  labs(
    x = "Rating",
    y = "Goal Accuracy",
    title = "Rating vs. Goal Accuracy"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 124 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 124 rows containing missing values (`geom_point()`).

# median number of goals by different football leagues
soccer_clean |>
  group_by(league_name) |>
  ggplot(mapping = aes(x = goals_total, y = league_name)) +
  geom_boxplot() +
  labs(
    x = "Number of Goals",
    y = "League",
    title = "Median Goals by League"
  )

# goals vs bookings
soccer_clean |>
  mutate(
    total_bookings = cards_yellow + cards_red + cards_yellowred
  ) |>
ggplot(mapping = aes(x = goals_total, y = total_bookings)) +
  geom_jitter() +
  geom_smooth() +
  labs(
    x = "Number of Goals",
    y = "Bookings",
    title = "Goals vs. Bookings"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# player height vs goal accuracy
#may need axis limits or just use goals_total if that gives better model instead of goal_acc
soccer_clean |>
  mutate(
        goal_acc = goals_total/shots_total
  ) |>
ggplot(mapping = aes(x = player_height_cm, y = goal_acc)) +
  geom_jitter() +
  geom_smooth() +
  labs(
    x = "Player Height (in cm)",
    y = "Goal Accuracy",
    title = "Player Height vs. Goal accuracy"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 131 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 131 rows containing missing values (`geom_point()`).

Questions for reviewers

  • How many research questions should we have– if only one or two, which ones do you suggest?

  • The data we are using is already pretty tidy (tidied when converting from API –> JSON –> CSV), therefore there are minimal cleaning steps in R-- confirming that this is acceptable, as it was indicated in the proposal directions that using an API was OK.