Project proposal

Author

dank-cheugy

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

The dataset is public and called “120 Years of Olympic history: athletes and results” from the source Kaggle, posted by rgriffin. The dataset’s provenance is Collection Methodology. The data includes all Olympic games from Athens 1896 to Rio 2016 and was scraped from www.sports-reference.com in May 2018. The data consists of the athlete’s name, body demographics, event, team, medal (bronze, silver, and gold), city of Olympic game, year, National Olympic Committee 3-letter code, and season. An important note to mention is that the Winter and Summer Games were held in the same year up until 1992 and then Winter Games were scattered to occur on a four year cycle starting with 1994, followed by Summer Games in 1996, and so on. The dataset was inspired by the idea behind how the Olympics have evolved over time, changes in participation, inclusion of different genders, nations, and sports. The file contains 271,116 rows and 15 columns. Each row corresponds to a single athlete, represented by the variable ID, competing in an individual event. The specific columns are as follows:

ID: Unique number for each athlete Name: Athlete’s name Sex: M or F (Male or Female) Age: Age of athlete (integer) Height: In centimeters (cm) Weight: In kilograms (kg) Team: Team’s name NOC: National Olympic Committee 3-letter code Games: Contains year and season Year: Contains the year in which the athlete participated in the olympics Season: Summer or Winter olympics City: Host city Sport: The sport played Event: The sports event that took place Medal: Gold, Silver, Bronze, or NA

olympic_data <- read_csv("data/athlete_events.csv")
Rows: 271116 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
dbl  (5): ID, Age, Height, Weight, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(olympic_data)
# A tibble: 6 × 15
     ID Name      Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr>     <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc…
2     2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond…
3     3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
4     4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
5     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
6     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
# ℹ 3 more variables: Sport <chr>, Event <chr>, Medal <chr>

Questions

The two questions you want to answer.

Question 1: How has gender representation changed over time across different sports, and how does it compare between Summer and Winter Olympics?

Question 2: Does an athlete’s size impact their ability and frequency of winning medals in different sports? How does the relationship between an athlete’s size and medal success vary across different sports?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1: Using a line plot, we will visualize the trend of female participation over time, comparing Summer and Winter Olympics. The data set includes the year, sex, sport, and season, allowing us to track changes in gender representation. By plotting the proportion of female athletes over time, we can observe key shifts in inclusion. We can also use stacked bar charts to show gender distribution in each Olympic year and heat maps to highlight when specific sports introduced female participation and compare the findings for each season.

Question 2: The dataset includes height and weight variables, which can be analyzed separately or combined to calculate an athlete’s Body Mass Index (BMI). By using the medal variable, we can explore potential correlations by plotting an athlete’s size alongside their medal success, representing medal type by dot color. A scatter plot can be used to visualize the relationship between an athlete’s size and medal success. This plot is ideal because it allows us to observe potential correlations between size and the number or type of medals won. We can distinguish between different medal types (gold, silver, bronze) or categorize athletes by gender, further enriching the analysis. This approach helps identify whether size has a significant impact on success across all athletes or within specific subgroups. We will determine the relationship between an athlete’s size and their medal success by the number of medals distinguishing which athletes have earned the highest ranked medals over time and what kind of body size is most correlated with this statistic. We understand that due to missing information because of how early the data traces back to, there may be some limitations affecting our data analysis. To mitigate this limitation, we will exclude any athletes with missing information regarding body size, for more accurate results solely based on body size and measurements.

Regarding both of our research questions, we are planning on considering any historical events/changes that may have potentially affected the data recorded and caused any drastic changes in the athletes between each year. For instance, Germany and Japan were not invited to participate in the first Olympic games after WWII taking place in 1948 in London. As a result of their absence, there will be no athletes recorded for these countries in our dataset. We will take other historical data into account as well.