Project proposal

Author

Dank Facts

library(tidyverse)

olympics <- read_csv('data/olympics.csv')

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

This dataset is from https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results/. This data set includes athlete data from Olympic games ranging from the year 1896 to 2016. According to the creator of this dataset, the dataset was consolidated using Olympic data from www.sports-reference.com.

olympics |>
  distinct(year) |>
  arrange(year)
# A tibble: 35 × 1
    year
   <dbl>
 1  1896
 2  1900
 3  1904
 4  1906
 5  1908
 6  1912
 7  1920
 8  1924
 9  1928
10  1932
# ℹ 25 more rows

The summer and winter Olympics were held in the same years until 1992 when they split into alternating every two years. After that, they staggered them such that the Winter Games occurred on a four-year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.

The dimensions are 271,116 observations with 15 variables. The 15 variables accounted for include a unique ID for each athlete, the name of the athlete, the sex of the athlete, the age of the athlete, the height of the athlete in cm, and the weight of the athlete in kgs, the team they represented, the national Olympic committee, the games they competed in, the year they competed, the season they competed in, the city the games were held in, the sport the athlete played, the event they were in, and the medal they received. If the athlete did not medal, they received an NA value for the final column.

glimpse(olympics)
Rows: 271,116
Columns: 15
$ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
$ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
$ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
$ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
$ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
$ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
$ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
$ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
$ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
$ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
$ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
$ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
$ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
$ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
$ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

In this dataset of 15 columns, there are only 4 columns with NA values. Those are age, height, weight, and medal. Based on the questions we planned to answer, we will only be using one of the four variables, which is the medal. The medal column has four values: Gold, Silver, Bronze, or NA. We will filter out the NA rows and compare the rows with Gold, Silver, or Bronze only. Since we plan to filter out rows with NA values in this column, the missing data will not affect our analysis.

na_counts_base <- colSums(is.na(olympics))
print(na_counts_base)
    id   name    sex    age height weight   team    noc  games   year season 
     0      0      0   9474  60171  62875      0      0      0      0      0 
  city  sport  event  medal 
     0      0      0 231333 

To address countries with name changes like Soviet Union (URS) and Russia, we plan on standardizing country names by mapping old country names to modern ones like URS = Russia. We will do this by comparing all unique nocs to the most recent nocs (which on is 2016 in this dataset). Then replace old nocs with modern ones.

# step 1: get all noc
unique(olympics$noc)
  [1] "CHN" "DEN" "NED" "USA" "FIN" "NOR" "ROU" "EST" "FRA" "MAR" "ESP" "EGY"
 [13] "IRI" "BUL" "ITA" "CHA" "AZE" "SUD" "RUS" "ARG" "CUB" "BLR" "GRE" "CMR"
 [25] "TUR" "CHI" "MEX" "URS" "NCA" "HUN" "NGR" "ALG" "KUW" "BRN" "PAK" "IRQ"
 [37] "UAR" "LIB" "QAT" "MAS" "GER" "CAN" "IRL" "AUS" "RSA" "ERI" "TAN" "JOR"
 [49] "TUN" "LBA" "BEL" "DJI" "PLE" "COM" "KAZ" "BRU" "IND" "KSA" "SYR" "MDV"
 [61] "ETH" "UAE" "YAR" "INA" "PHI" "SGP" "UZB" "KGZ" "TJK" "EUN" "JPN" "CGO"
 [73] "SUI" "BRA" "FRG" "GDR" "MON" "ISR" "URU" "SWE" "ISV" "SRI" "ARM" "CIV"
 [85] "KEN" "BEN" "UKR" "GBR" "GHA" "SOM" "LAT" "NIG" "MLI" "AFG" "POL" "CRC"
 [97] "PAN" "GEO" "SLO" "CRO" "GUY" "NZL" "POR" "PAR" "ANG" "VEN" "COL" "BAN"
[109] "PER" "ESA" "PUR" "UGA" "HON" "ECU" "TKM" "MRI" "SEY" "TCH" "LUX" "MTN"
[121] "CZE" "SKN" "TTO" "DOM" "VIN" "JAM" "LBR" "SUR" "NEP" "MGL" "AUT" "PLW"
[133] "LTU" "TOG" "NAM" "AHO" "ISL" "ASA" "SAM" "RWA" "DMA" "HAI" "MLT" "CYP"
[145] "GUI" "BIZ" "YMD" "KOR" "THA" "BER" "ANZ" "SCG" "SLE" "PNG" "YEM" "IOA"
[157] "OMA" "FIJ" "VAN" "MDA" "YUG" "BAH" "GUA" "SRB" "IVB" "MOZ" "CAF" "MAD"
[169] "MAL" "BIH" "GUM" "CAY" "SVK" "BAR" "GBS" "TLS" "COD" "GAB" "SMR" "LAO"
[181] "BOT" "ROT" "CAM" "PRK" "SOL" "SEN" "CPV" "CRT" "GEQ" "BOL" "SAA" "AND"
[193] "ANT" "ZIM" "GRN" "HKG" "LCA" "FSM" "MYA" "MAW" "ZAM" "RHO" "TPE" "STP"
[205] "MKD" "BOH" "TGA" "LIE" "MNE" "GAM" "COK" "ALB" "WIF" "SWZ" "BUR" "NBO"
[217] "BDI" "ARU" "NRU" "VNM" "VIE" "BHU" "MHL" "KIR" "UNK" "TUV" "NFL" "KOS"
[229] "SSD" "LES"
# step 2: 2016 noc
noc_2016 <- olympics %>%
  filter(year == 2016) %>%
  select(noc) %>%
  distinct()
print(noc_2016)
# A tibble: 207 × 1
   noc  
   <chr>
 1 ROU  
 2 ESP  
 3 ITA  
 4 AZE  
 5 FRA  
 6 ALG  
 7 BRN  
 8 NED  
 9 IRQ  
10 IRL  
# ℹ 197 more rows
# step 3: compare
# ex: 
country_mapping <- c(
  "URS" = "Russia")

We chose this dataset because we are all interested in the Olympic games. In particular, we were interested in how the participation of women in the Olympic games changed throughout the years as the 2024 Olympics was the first Olympic where there was an equal number of men and women participants. We are also interested in seeing how trends in countries’ medals have changed over time as now more smaller countries are demonstrating medal growth. These changes could be due to progress in gender equality in sports and increased support and funding in sports in small nations.

Questions

The two questions you want to answer.

Question #1: How has the representation of male and female athletes changed over time?

Question #2: Which regions have shown the most improvement in medal counts over time?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

To analyze changes in the gender representation of Olympic athletes over time, we will utilize the games (winter/summer competition), year, sex, and sport variables. We will group the data by each of these to create an athlete_count for each sport by competition (year/season combination) and gender. Our visualizations will depict the overall changes in gender representation over time as well as the changes for individual sports, with the goal of identifying competitions that have been particularly inclusive or non-inclusive.

For our second goal of showing performance changes by region, we will utilize the year, games, team, and medal variables. Because countries send varying numbers of athletes to each Olympics competition, we will create a medal_rate variable that shows the proportion of athletes from each country, within a region, that win medals overall, broken down by games. We will also create a medal_rate_typed that further breaks down performance by the type of medal (gold, silver, and bronze). Finally, we will create a medal_growth variable that shows the difference in earned medals by country from game to game. Visualizations for this question will depict medal_rate, medal_rate_typed, and medal_growth over time for each country in a given region. We will then use these visualizations to identify consistent overperformers and underperformers. We will also incorporate the sport variable to search for “dynasties” where countries performed particularly well for set periods of time in specific events.