Assessing Top Soccer Player Statistics

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • The World Bank
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • Various sources from various years especially from DHS (demographics and health surveys), i2d2, MICS (Unicef), or IHSDHS (World Bank). The data in the .xlsx file comes from a culmination of all of the sources.
  • Write a brief description of the observations.

    • Countries split by different years of data obtained so there can be different trends of the countries seen by different statistics based on the years the data is gathered.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What are the countries that experienced the most drastic shift in educational attainability among the females and the lower classes. What countries did not experience that much of a shift?

    • What are current countries that have the greatest educational disparity among either gender, place of living, or monetary status?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Research statement: Understanding differences in education levels among different socioeconomic sectors and years in countries.

    • Hypothesis: Countries that are considered “first world” such as the US, Canada, and the UK will have the highest education levels across all socioeconomic statuses and have not drastically changed that much throughout the years while countries such as India and other countries considered “third world” might have had an increase in education levels throughout the years but not to the level of the first world countries.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • These are quantitative variables.

Glimpse of data

NOTE: THERE ARE TOO MANY COLUMNS (4816) FOR SKIM TO WORK QUICKLY AND EFFICIENTLY SO THREE COLUMNS ARE USED TO SHOW THAT WE USED THE SKIM FUNCTION

education <- read_csv("data/edu.csv")
Rows: 626 Columns: 4816
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (19): country, stype, ccode, datasrc, cname, ccode3, countryname, coun...
dbl (4797): year, aAll_1, aUrban_1, aRural_1, aMale_1, aFemale_1, aMalUrb_1,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(education, "aAll_1", "aUrban_1", "aRural_1")
Data summary
Name education
Number of rows 626
Number of columns 4816
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
aAll_1 0 1.00 0.87 0.16 0.25 0.83 0.95 0.98 1 ▁▁▁▂▇
aUrban_1 10 0.98 0.93 0.09 0.54 0.92 0.97 0.99 1 ▁▁▁▂▇
aRural_1 10 0.98 0.84 0.20 0.08 0.78 0.93 0.98 1 ▁▁▁▂▇

Data 2

Introduction and data

  • Identify the source of the data.

    • The World Health Organization Global Health Expenditure Database
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • Member states who report the data to WHO. WHO can make estimations to fill any gaps. Data is updated yearly with a two-year lag.
  • Write a brief description of the observations.

    • Every observation is a country/year combination and corresponding variables pertaining to expenditures on disease categories such as infectious/parasitic diseases, NCDs, reproductive health, nutritional deficiencies, or reproductive health.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What are the priorities of different countries/different regions of countries, based on their percentage of health expenditures of various categories of disease?

    • Do high and low income countries have the same health priorities or different ones?

    • How do health priorities compare by region?

    • By joining with another dataset, we may be able to answer whether increased spending has a relationship with less disease.

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic is based on various expenditures by country and year, and how they relate to a variety of diseases. Our hypothesis is, How does the distribution of a countries health expenditures and the resulting out of pocket expenses correlate with various diseases and conditions?
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Country, region, income, and disease category are categorical. Year can be either, depending on how it is used. All other variables are quantitative.
    • Some other quantitative variables we could use are listed below, and they represent various expenditures per capita in USD.
      • che (current health expenditure in millions)

      • che_gdp (current health expenditure as a percent of nation’s gdp)

      • che_pc_usd (current health expenditure per capita )

      • gghed_pc_usd (domestic general government health expenditure per capita)

      • pvtd_pc_usd (domestic private health expenditure per capita)

      • oop_pc_usd (out of pocket expenditure per capita)

Glimpse of data

library("readxl")
health <- read_excel("data/GHED_data.XLSX", sheet = "Data")

skim(health, "country", "region", "income", "che", "che_gdp", "che_pc_usd", "gghed_pc_usd", "pvtd_pc_usd", "oop_pc_usd")
Data summary
Name health
Number of rows 4224
Number of columns 3220
_______________________
Column type frequency:
character 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 4 34 0 192 0
region 0 1 3 4 0 6 0
income 0 1 3 12 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
che 243 0.94 6597770.22 66623331.23 0.06 1656.00 26441.40 236615.31 2.102209e+09 ▇▁▁▁▁
che_gdp 244 0.94 6.24 2.98 1.26 4.15 5.67 7.93 5.018000e+01 ▇▁▁▁▁
che_pc_usd 243 0.94 917.10 1593.17 4.45 63.14 259.03 867.06 1.170241e+04 ▇▁▁▁▁
gghed_pc_usd 252 0.94 615.95 1152.05 0.15 21.97 130.38 545.15 7.857190e+03 ▇▁▁▁▁
pvtd_pc_usd 254 0.94 278.20 564.90 0.91 23.95 87.47 294.01 6.788900e+03 ▇▁▁▁▁
oop_pc_usd 254 0.94 182.30 271.12 0.09 19.56 69.39 225.47 2.761980e+03 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

Source: https://www.api-football.com/documentation-v3

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was created by using the API collected from https://www.api-football.com/documentation-v3. We collected the data by making a GET request to their top scorers endpoint, focusing on 27 of the top leagues in 2021. Thus, as you can see in the code snippet below for each of the leagues we make an API call to fetch the top scorers in the league. The API returns the top 20 goal scorers for the respective league as specified per the query string parameters.

  • Write a brief description of the observations.

Each observation represents a top 20 football scorer (with regards to their league) for the 2021 season. From this we can have see a lot of interesting background information of the player like their height, weight, nationality, etc. In addition to the background information, we also have a lot of statistics for the player in the given season (like passes, goals scored, assists, penalties, etc).

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is their a correlation between height and/or weight and specific game statistics (Including: goals scored, assists, penalties, passes, etc)

  • As a sub question: Do taller players complete a larger percentage of passes?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Glimpse of data

library(httr)
library(jsonlite)

Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':

    flatten
top_scorers <- read_csv("./data/top_goalscorers.csv")
New names:
• `player.id` -> `player.id...1`
• `player.id` -> `player.id...48`
Rows: 560 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): team.name, team.logo, league.name, league.country, league.logo, l...
dbl  (36): player.id...1, team.id, league.id, league.season, games.appearenc...
lgl   (8): games.number, games.captain, goals.saves, dribbles.past, penalty....
date  (1): player.birth.date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(top_scorers)
Data summary
Name top_scorers
Number of rows 560
Number of columns 61
_______________________
Column type frequency:
character 16
Date 1
logical 8
numeric 36
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team.name 0 1.00 3 27 0 355 0
team.logo 0 1.00 51 54 0 455 0
league.name 0 1.00 7 26 0 26 0
league.country 0 1.00 3 12 0 13 0
league.logo 0 1.00 53 54 0 71 0
league.flag 0 1.00 42 42 0 36 0
games.position 0 1.00 8 10 0 3 0
player.name 0 1.00 3 32 0 544 0
player.firstname 0 1.00 3 30 0 504 0
player.lastname 0 1.00 2 27 0 535 0
player.nationality 0 1.00 3 22 0 68 0
player.height 34 0.94 6 6 0 35 0
player.weight 57 0.90 5 5 0 40 0
player.photo 0 1.00 53 57 0 555 0
player.birth.place 42 0.92 1 26 0 405 1
player.birth.country 0 1.00 3 22 0 63 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
player.birth.date 0 1 1981-04-06 2003-06-13 1994-07-26 522

Variable type: logical

skim_variable n_missing complete_rate mean count
games.number 560 0 NaN :
games.captain 0 1 0 FAL: 560
goals.saves 560 0 NaN :
dribbles.past 560 0 NaN :
penalty.won 560 0 NaN :
penalty.commited 560 0 NaN :
penalty.saved 560 0 NaN :
player.injured 0 1 0 FAL: 560

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
player.id…1 0 1.00 280.50 161.80 1.0 140.75 280.50 420.25 560.0 ▇▇▇▇▇
team.id 0 1.00 1377.56 2288.07 33.0 178.75 504.00 1361.75 15130.0 ▇▁▁▁▁
league.id 0 1.00 114.21 64.97 39.0 72.75 91.50 137.75 307.0 ▇▆▁▁▁
league.season 0 1.00 2021.00 0.00 2021.0 2021.00 2021.00 2021.00 2021.0 ▁▁▇▁▁
games.appearences 0 1.00 29.53 9.60 1.0 27.00 32.00 36.00 49.0 ▁▁▃▇▁
games.lineups 0 1.00 25.28 9.75 1.0 19.00 27.00 32.00 49.0 ▂▅▇▇▁
games.minutes 0 1.00 2205.56 834.30 59.0 1732.75 2339.50 2783.00 4335.0 ▂▃▇▆▁
games.rating 122 0.78 7.11 0.27 6.2 6.93 7.07 7.26 8.3 ▁▇▇▂▁
substitutes.in 0 1.00 4.24 4.40 0.0 1.00 3.00 6.00 26.0 ▇▂▁▁▁
substitutes.out 0 1.00 11.59 6.69 0.0 6.00 12.00 16.00 32.0 ▆▆▇▂▁
substitutes.bench 0 1.00 4.69 5.25 0.0 1.00 3.00 7.00 31.0 ▇▂▁▁▁
shots.total 124 0.78 52.82 25.61 1.0 38.00 52.00 67.00 162.0 ▂▇▃▁▁
shots.on 126 0.78 28.92 14.17 1.0 21.00 28.00 36.00 92.0 ▃▇▃▁▁
goals.total 0 1.00 11.65 5.57 2.0 8.00 11.00 14.00 43.0 ▇▇▂▁▁
goals.conceded 122 0.78 0.00 0.00 0.0 0.00 0.00 0.00 0.0 ▁▁▇▁▁
goals.assists 177 0.68 4.64 3.20 1.0 2.00 4.00 6.00 19.0 ▇▅▁▁▁
passes.total 122 0.78 639.12 356.61 3.0 409.75 624.50 856.75 2301.0 ▅▇▃▁▁
passes.key 130 0.77 34.81 21.90 1.0 19.00 31.00 46.00 142.0 ▇▆▂▁▁
passes.accuracy 122 0.78 16.07 8.51 2.0 10.00 14.00 20.00 68.0 ▇▅▁▁▁
tackles.total 131 0.77 20.79 15.47 1.0 10.00 17.00 29.00 101.0 ▇▃▁▁▁
tackles.blocks 219 0.61 2.87 1.97 1.0 1.00 2.00 4.00 10.0 ▇▃▂▁▁
tackles.interceptions 147 0.74 9.30 7.41 1.0 4.00 7.00 13.00 50.0 ▇▃▁▁▁
duels.total 122 0.78 280.39 148.71 1.0 191.00 275.50 364.00 897.0 ▃▇▃▁▁
duels.won 123 0.78 124.60 69.39 1.0 81.00 121.00 163.00 467.0 ▅▇▂▁▁
dribbles.attempts 127 0.77 51.97 37.55 1.0 24.00 43.00 72.00 209.0 ▇▆▂▁▁
dribbles.success 137 0.76 26.70 19.84 1.0 12.00 22.00 37.00 111.0 ▇▅▂▁▁
fouls.drawn 129 0.77 35.47 21.17 1.0 21.00 34.00 47.00 119.0 ▆▇▃▁▁
fouls.committed 127 0.77 30.28 18.00 1.0 17.00 29.00 41.00 104.0 ▇▇▃▁▁
cards.yellow 0 1.00 3.78 2.71 0.0 2.00 3.00 5.00 14.0 ▇▇▃▁▁
cards.yellowred 0 1.00 0.06 0.23 0.0 0.00 0.00 0.00 1.0 ▇▁▁▁▁
cards.red 0 1.00 0.09 0.31 0.0 0.00 0.00 0.00 3.0 ▇▁▁▁▁
penalty.scored 122 0.78 1.58 1.88 0.0 0.00 1.00 3.00 9.0 ▇▃▂▁▁
penalty.missed 122 0.78 0.37 0.67 0.0 0.00 0.00 1.00 4.0 ▇▂▁▁▁
player.id…48 0 1.00 39532.19 52257.87 25.0 8892.50 25118.50 45493.50 323800.0 ▇▁▁▁▁
player.age 0 1.00 29.26 4.32 20.0 26.00 29.00 33.00 42.0 ▃▇▇▃▁
id 0 1.00 10.48 5.76 1.0 5.75 10.00 15.00 20.0 ▇▇▇▇▇
top_scorers
# A tibble: 560 × 61
   player.id...1 team.id team.name         team.logo       league.id league.name
           <dbl>   <dbl> <chr>             <chr>               <dbl> <chr>      
 1             1      47 Tottenham         https://media-…        39 Premier Le…
 2             2      40 Liverpool         https://media-…        39 Premier Le…
 3             3      33 Manchester United https://media-…        39 Premier Le…
 4             4      47 Tottenham         https://media-…        39 Premier Le…
 5             5      40 Liverpool         https://media-…        39 Premier Le…
 6             6      50 Manchester City   https://media-…        39 Premier Le…
 7             7      40 Liverpool         https://media-…        39 Premier Le…
 8             8      46 Leicester         https://media-…        39 Premier Le…
 9             9      52 Crystal Palace    https://media-…        39 Premier Le…
10            10      50 Manchester City   https://media-…        39 Premier Le…
# ℹ 550 more rows
# ℹ 55 more variables: league.country <chr>, league.logo <chr>,
#   league.flag <chr>, league.season <dbl>, games.appearences <dbl>,
#   games.lineups <dbl>, games.minutes <dbl>, games.number <lgl>,
#   games.position <chr>, games.rating <dbl>, games.captain <lgl>,
#   substitutes.in <dbl>, substitutes.out <dbl>, substitutes.bench <dbl>,
#   shots.total <dbl>, shots.on <dbl>, goals.total <dbl>, …
# url <- "https://api-football-v1.p.rapidapi.com/v3/players/topscorers"

#  == LEAGUE IDs BELOW ==
# England League 39, 40, 41
# France League 1 61 2, 63
# Saudi Arabia Pro League 307
# Brazil Serie A 71, B 72, C 73, D 76
# Italy Serie A 135, B 136, C 137, D [426,427,428,429]
# Portugal 1 94, 2 95
# Argentina A 128, B 131
# Germany 78, 79, 80
# Spain 140, 141, 142
# USA 254
# Belgium 144, 145
# Netherlands 88, 89

# == DATA FETCH & SANITIZATION BELOW == 
# leagues <- c("39", "40", "41", "307", "61", "63", "71", "72", "73", "76", "135", "136", "137", "94", "95", "128", "131", "78", "79", "80", "140", "141", "253", "262", "144", "145", "88", "89")
# 
# queryString <- list(
#   league = "39",
#   season = "2021"
# )
# 
# all_leagues <- NULL
# 
# for (x in leagues) {
#   queryString <- list(
#     league = x,
#     season = "2021"
#   )
# 
#   # Verb snippet below provided by API docs
#   # NOTE: API key will incur rate-limits, so may not work at the time [FREE Limitations]
#   response <- VERB("GET", url, add_headers("X-RapidAPI-Key" = "dfd9043453mshb6c979e3b50cc80p1efc9bjsn4805f4646934", "X-RapidAPI-Host" = "api-football-v1.p.rapidapi.com"), query = queryString, content_type("application/json"))
# 
#   text_res <- content(response, "text")
# 
#   player_info <- fromJSON(text_res, flatten = TRUE)
# 
#   player_tibble <- as_tibble(player_info$response)
# 
#   player_tibble <- player_tibble |>
#     mutate(id = row_number()) |>
#     unnest(statistics)
# 
#   print(x)
#   all_leagues <- bind_rows(all_leagues, player_tibble)
# }
# 
# write.table(all_leagues, file = "./data/top_goalscorers.csv")