Assessing Top Soccer Player Statistics

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
- The World Bank
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Various sources from various years especially from DHS (demographics and health surveys), i2d2, MICS (Unicef), or IHSDHS (World Bank). The data in the .xlsx file comes from a culmination of all of the sources.
Write a brief description of the observations.
- Countries split by different years of data obtained so there can be different trends of the countries seen by different statistics based on the years the data is gathered.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What are the countries that experienced the most drastic shift in educational attainability among the females and the lower classes. What countries did not experience that much of a shift?
- What are current countries that have the greatest educational disparity among either gender, place of living, or monetary status?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Research statement: Understanding differences in education levels among different socioeconomic sectors and years in countries.
- Hypothesis: Countries that are considered “first world” such as the US, Canada, and the UK will have the highest education levels across all socioeconomic statuses and have not drastically changed that much throughout the years while countries such as India and other countries considered “third world” might have had an increase in education levels throughout the years but not to the level of the first world countries.
Identify the types of variables in your research question. Categorical? Quantitative?
- These are quantitative variables.

Glimpse of data

NOTE: THERE ARE TOO MANY COLUMNS (4816) FOR SKIM TO WORK QUICKLY AND EFFICIENTLY SO THREE COLUMNS ARE USED TO SHOW THAT WE USED THE SKIM FUNCTION

education <- read_csv("data/edu.csv")

Rows: 626 Columns: 4816
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (19): country, stype, ccode, datasrc, cname, ccode3, countryname, coun...
dbl (4797): year, aAll_1, aUrban_1, aRural_1, aMale_1, aFemale_1, aMalUrb_1,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(education, "aAll_1", "aUrban_1", "aRural_1")

Data summary
Name	education
Number of rows	626
Number of columns	4816
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
aAll_1	0	1.00	0.87	0.16	0.25	0.83	0.95	0.98	1	▁▁▁▂▇
aUrban_1	10	0.98	0.93	0.09	0.54	0.92	0.97	0.99	1	▁▁▁▂▇
aRural_1	10	0.98	0.84	0.20	0.08	0.78	0.93	0.98	1	▁▁▁▂▇

Data 2

Introduction and data

Identify the source of the data.
- The World Health Organization Global Health Expenditure Database
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Member states who report the data to WHO. WHO can make estimations to fill any gaps. Data is updated yearly with a two-year lag.
Write a brief description of the observations.
- Every observation is a country/year combination and corresponding variables pertaining to expenditures on disease categories such as infectious/parasitic diseases, NCDs, reproductive health, nutritional deficiencies, or reproductive health.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What are the priorities of different countries/different regions of countries, based on their percentage of health expenditures of various categories of disease?
- Do high and low income countries have the same health priorities or different ones?
- How do health priorities compare by region?
- By joining with another dataset, we may be able to answer whether increased spending has a relationship with less disease.
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic is based on various expenditures by country and year, and how they relate to a variety of diseases. Our hypothesis is, How does the distribution of a countries health expenditures and the resulting out of pocket expenses correlate with various diseases and conditions?
Identify the types of variables in your research question. Categorical? Quantitative?
- Country, region, income, and disease category are categorical. Year can be either, depending on how it is used. All other variables are quantitative.
- Some other quantitative variables we could use are listed below, and they represent various expenditures per capita in USD.
  - che (current health expenditure in millions)
  - che_gdp (current health expenditure as a percent of nation’s gdp)
  - che_pc_usd (current health expenditure per capita )
  - gghed_pc_usd (domestic general government health expenditure per capita)
  - pvtd_pc_usd (domestic private health expenditure per capita)
  - oop_pc_usd (out of pocket expenditure per capita)

Glimpse of data

library("readxl")
health <- read_excel("data/GHED_data.XLSX", sheet = "Data")

skim(health, "country", "region", "income", "che", "che_gdp", "che_pc_usd", "gghed_pc_usd", "pvtd_pc_usd", "oop_pc_usd")

Data summary
Name	health
Number of rows	4224
Number of columns	3220
_______________________
Column type frequency:
character	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
country	1	4	34	192
region	1	3	4	6
income	1	3	12	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
che	243	0.94	6597770.22	66623331.23	0.06	1656.00	26441.40	236615.31	2.102209e+09	▇▁▁▁▁
che_gdp	244	0.94	6.24	2.98	1.26	4.15	5.67	7.93	5.018000e+01	▇▁▁▁▁
che_pc_usd	243	0.94	917.10	1593.17	4.45	63.14	259.03	867.06	1.170241e+04	▇▁▁▁▁
gghed_pc_usd	252	0.94	615.95	1152.05	0.15	21.97	130.38	545.15	7.857190e+03	▇▁▁▁▁
pvtd_pc_usd	254	0.94	278.20	564.90	0.91	23.95	87.47	294.01	6.788900e+03	▇▁▁▁▁
oop_pc_usd	254	0.94	182.30	271.12	0.09	19.56	69.39	225.47	2.761980e+03	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.

Source: https://www.api-football.com/documentation-v3

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was created by using the API collected from https://www.api-football.com/documentation-v3. We collected the data by making a GET request to their top scorers endpoint, focusing on 27 of the top leagues in 2021. Thus, as you can see in the code snippet below for each of the leagues we make an API call to fetch the top scorers in the league. The API returns the top 20 goal scorers for the respective league as specified per the query string parameters.

Write a brief description of the observations.

Each observation represents a top 20 football scorer (with regards to their league) for the 2021 season. From this we can have see a lot of interesting background information of the player like their height, weight, nationality, etc. In addition to the background information, we also have a lot of statistics for the player in the given season (like passes, goals scored, assists, penalties, etc).

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is their a correlation between height and/or weight and specific game statistics (Including: goals scored, assists, penalties, passes, etc)

As a sub question: Do taller players complete a larger percentage of passes?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

Glimpse of data

library(httr)
library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

top_scorers <- read_csv("./data/top_goalscorers.csv")

New names:
• `player.id` -> `player.id...1`
• `player.id` -> `player.id...48`

Rows: 560 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): team.name, team.logo, league.name, league.country, league.logo, l...
dbl  (36): player.id...1, team.id, league.id, league.season, games.appearenc...
lgl   (8): games.number, games.captain, goals.saves, dribbles.past, penalty....
date  (1): player.birth.date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(top_scorers)

Data summary
Name	top_scorers
Number of rows	560
Number of columns	61
_______________________
Column type frequency:
character	16
Date	1
logical	8
numeric	36
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique	whitespace
team.name	0	1.00	3	27	355	0
team.logo	0	1.00	51	54	455	0
league.name	0	1.00	7	26	26	0
league.country	0	1.00	3	12	13	0
league.logo	0	1.00	53	54	71	0
league.flag	0	1.00	42	42	36	0
games.position	0	1.00	8	10	3	0
player.name	0	1.00	3	32	544	0
player.firstname	0	1.00	3	30	504	0
player.lastname	0	1.00	2	27	535	0
player.nationality	0	1.00	3	22	68	0
player.height	34	0.94	6	6	35	0
player.weight	57	0.90	5	5	40	0
player.photo	0	1.00	53	57	555	0
player.birth.place	42	0.92	1	26	405	1
player.birth.country	0	1.00	3	22	63	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
player.birth.date	0	1	1981-04-06	2003-06-13	1994-07-26	522

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
games.number	560	0	NaN	:
games.captain	0	1	0	FAL: 560
goals.saves	560	0	NaN	:
dribbles.past	560	0	NaN	:
penalty.won	560	0	NaN	:
penalty.commited	560	0	NaN	:
penalty.saved	560	0	NaN	:
player.injured	0	1	0	FAL: 560

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
player.id…1	0	1.00	280.50	161.80	1.0	140.75	280.50	420.25	560.0	▇▇▇▇▇
team.id	0	1.00	1377.56	2288.07	33.0	178.75	504.00	1361.75	15130.0	▇▁▁▁▁
league.id	0	1.00	114.21	64.97	39.0	72.75	91.50	137.75	307.0	▇▆▁▁▁
league.season	0	1.00	2021.00	0.00	2021.0	2021.00	2021.00	2021.00	2021.0	▁▁▇▁▁
games.appearences	0	1.00	29.53	9.60	1.0	27.00	32.00	36.00	49.0	▁▁▃▇▁
games.lineups	0	1.00	25.28	9.75	1.0	19.00	27.00	32.00	49.0	▂▅▇▇▁
games.minutes	0	1.00	2205.56	834.30	59.0	1732.75	2339.50	2783.00	4335.0	▂▃▇▆▁
games.rating	122	0.78	7.11	0.27	6.2	6.93	7.07	7.26	8.3	▁▇▇▂▁
substitutes.in	0	1.00	4.24	4.40	0.0	1.00	3.00	6.00	26.0	▇▂▁▁▁
substitutes.out	0	1.00	11.59	6.69	0.0	6.00	12.00	16.00	32.0	▆▆▇▂▁
substitutes.bench	0	1.00	4.69	5.25	0.0	1.00	3.00	7.00	31.0	▇▂▁▁▁
shots.total	124	0.78	52.82	25.61	1.0	38.00	52.00	67.00	162.0	▂▇▃▁▁
shots.on	126	0.78	28.92	14.17	1.0	21.00	28.00	36.00	92.0	▃▇▃▁▁
goals.total	0	1.00	11.65	5.57	2.0	8.00	11.00	14.00	43.0	▇▇▂▁▁
goals.conceded	122	0.78	0.00	0.00	0.0	0.00	0.00	0.00	0.0	▁▁▇▁▁
goals.assists	177	0.68	4.64	3.20	1.0	2.00	4.00	6.00	19.0	▇▅▁▁▁
passes.total	122	0.78	639.12	356.61	3.0	409.75	624.50	856.75	2301.0	▅▇▃▁▁
passes.key	130	0.77	34.81	21.90	1.0	19.00	31.00	46.00	142.0	▇▆▂▁▁
passes.accuracy	122	0.78	16.07	8.51	2.0	10.00	14.00	20.00	68.0	▇▅▁▁▁
tackles.total	131	0.77	20.79	15.47	1.0	10.00	17.00	29.00	101.0	▇▃▁▁▁
tackles.blocks	219	0.61	2.87	1.97	1.0	1.00	2.00	4.00	10.0	▇▃▂▁▁
tackles.interceptions	147	0.74	9.30	7.41	1.0	4.00	7.00	13.00	50.0	▇▃▁▁▁
duels.total	122	0.78	280.39	148.71	1.0	191.00	275.50	364.00	897.0	▃▇▃▁▁
duels.won	123	0.78	124.60	69.39	1.0	81.00	121.00	163.00	467.0	▅▇▂▁▁
dribbles.attempts	127	0.77	51.97	37.55	1.0	24.00	43.00	72.00	209.0	▇▆▂▁▁
dribbles.success	137	0.76	26.70	19.84	1.0	12.00	22.00	37.00	111.0	▇▅▂▁▁
fouls.drawn	129	0.77	35.47	21.17	1.0	21.00	34.00	47.00	119.0	▆▇▃▁▁
fouls.committed	127	0.77	30.28	18.00	1.0	17.00	29.00	41.00	104.0	▇▇▃▁▁
cards.yellow	0	1.00	3.78	2.71	0.0	2.00	3.00	5.00	14.0	▇▇▃▁▁
cards.yellowred	0	1.00	0.06	0.23	0.0	0.00	0.00	0.00	1.0	▇▁▁▁▁
cards.red	0	1.00	0.09	0.31	0.0	0.00	0.00	0.00	3.0	▇▁▁▁▁
penalty.scored	122	0.78	1.58	1.88	0.0	0.00	1.00	3.00	9.0	▇▃▂▁▁
penalty.missed	122	0.78	0.37	0.67	0.0	0.00	0.00	1.00	4.0	▇▂▁▁▁
player.id…48	0	1.00	39532.19	52257.87	25.0	8892.50	25118.50	45493.50	323800.0	▇▁▁▁▁
player.age	0	1.00	29.26	4.32	20.0	26.00	29.00	33.00	42.0	▃▇▇▃▁
id	0	1.00	10.48	5.76	1.0	5.75	10.00	15.00	20.0	▇▇▇▇▇

top_scorers

# A tibble: 560 × 61
   player.id...1 team.id team.name         team.logo       league.id league.name
           <dbl>   <dbl> <chr>             <chr>               <dbl> <chr>      
 1             1      47 Tottenham         https://media-…        39 Premier Le…
 2             2      40 Liverpool         https://media-…        39 Premier Le…
 3             3      33 Manchester United https://media-…        39 Premier Le…
 4             4      47 Tottenham         https://media-…        39 Premier Le…
 5             5      40 Liverpool         https://media-…        39 Premier Le…
 6             6      50 Manchester City   https://media-…        39 Premier Le…
 7             7      40 Liverpool         https://media-…        39 Premier Le…
 8             8      46 Leicester         https://media-…        39 Premier Le…
 9             9      52 Crystal Palace    https://media-…        39 Premier Le…
10            10      50 Manchester City   https://media-…        39 Premier Le…
# ℹ 550 more rows
# ℹ 55 more variables: league.country <chr>, league.logo <chr>,
#   league.flag <chr>, league.season <dbl>, games.appearences <dbl>,
#   games.lineups <dbl>, games.minutes <dbl>, games.number <lgl>,
#   games.position <chr>, games.rating <dbl>, games.captain <lgl>,
#   substitutes.in <dbl>, substitutes.out <dbl>, substitutes.bench <dbl>,
#   shots.total <dbl>, shots.on <dbl>, goals.total <dbl>, …

# url <- "https://api-football-v1.p.rapidapi.com/v3/players/topscorers"

#  == LEAGUE IDs BELOW ==
# England League 39, 40, 41
# France League 1 61 2, 63
# Saudi Arabia Pro League 307
# Brazil Serie A 71, B 72, C 73, D 76
# Italy Serie A 135, B 136, C 137, D [426,427,428,429]
# Portugal 1 94, 2 95
# Argentina A 128, B 131
# Germany 78, 79, 80
# Spain 140, 141, 142
# USA 254
# Belgium 144, 145
# Netherlands 88, 89

# == DATA FETCH & SANITIZATION BELOW == 
# leagues <- c("39", "40", "41", "307", "61", "63", "71", "72", "73", "76", "135", "136", "137", "94", "95", "128", "131", "78", "79", "80", "140", "141", "253", "262", "144", "145", "88", "89")
# 
# queryString <- list(
#   league = "39",
#   season = "2021"
# )
# 
# all_leagues <- NULL
# 
# for (x in leagues) {
#   queryString <- list(
#     league = x,
#     season = "2021"
#   )
# 
#   # Verb snippet below provided by API docs
#   # NOTE: API key will incur rate-limits, so may not work at the time [FREE Limitations]
#   response <- VERB("GET", url, add_headers("X-RapidAPI-Key" = "dfd9043453mshb6c979e3b50cc80p1efc9bjsn4805f4646934", "X-RapidAPI-Host" = "api-football-v1.p.rapidapi.com"), query = queryString, content_type("application/json"))
# 
#   text_res <- content(response, "text")
# 
#   player_info <- fromJSON(text_res, flatten = TRUE)
# 
#   player_tibble <- as_tibble(player_info$response)
# 
#   player_tibble <- player_tibble |>
#     mutate(id = row_number()) |>
#     unnest(statistics)
# 
#   print(x)
#   all_leagues <- bind_rows(all_leagues, player_tibble)
# }
# 
# write.table(all_leagues, file = "./data/top_goalscorers.csv")