Project proposal

Author

dank-clapback (Arya Ramkumar, Dylan Retino, Jolene Ie, Rachel Liu)

library(tidyverse)
athlete_data <- read_csv("data/athlete_events.csv")

num_rows = nrow(athlete_data)
num_cols = ncol(athlete_data)
glimpse(athlete_data)
Rows: 271,116
Columns: 15
$ ID     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
$ Name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
$ Sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
$ Age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
$ Height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
$ Weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
$ Team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
$ NOC    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
$ Games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
$ Year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
$ Season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
$ City   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
$ Sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
$ Event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
$ Medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

The data set we chose is from RGriffin Kaggle data set and contains information about the past 120 years of Olympic history, starting from the Olympics Athens Games in 1896 to the 2016 Rio Olympics. Basic bio data of each athlete is provided in categories such as the athlete’s name, weight, height, and more. The dataset contains 271116 rows and 15 columns, with numerical variables (e.g., year, age, height, weight, ID) and categorical variables (e.g., city, sport, medal type). But certain athletes may be missing information on height, weight, or medal status. This could be due to historical record-keeping data practices because height and weight data were not always recorded in earlier Olympic Games leading to these gaps in the data set.

In regard to why we chose this dataset, we were largely influenced by the excitement following the Superbowl as well as the general buzz that comes with any major sporting event, especially the Olympics. Having all been huge fans of both the summer and winter Olympics, we felt that it would be interesting to take a deeper dive into various questions we’ve had about the athletes, their performance, and various facets of representation. It was also helpful that the data set included a wide range of information, including all the athletes’ heights, weights, events, years, and medals. The fact that it covers 120 years of information was also a contributing factor to our decision as we would be able to easily survey long-term trends and even note differences between different decades and time periods. Ultimately, we felt that the data set not only connected well with our interests, it was also fairly diversified and would allow us to answer questions that we would love to learn more about.

Questions

The two questions you want to answer.

Question 1: How has the representation of male and female athletes in various olympic sports changed over time, specifically what factors and key milestones have influenced shifts in gender diversity across different sports and events?

Question 2: What is the relationship between an athlete’s country of origin and their likelihood of winning a medal in the Summer and Winter Olympics, considering factors such as historical dominance, investment in sports, and climate conditions?

Based on our feedback, we did not want to relate our questions too closely because we did not want too much of an overlap in variables. We reread the instructions, and want to keep our questions distinct, so that we could tell two different stories through our visualizations.

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

The sports we will be analyzing for our project are: Athletics (Track & Field) Swimming Weightlifting Gymnastics Wrestling Soccer (Football) Basketball Judo Rowing Cycling

We selected these sports because they represent a diverse mix of individual and team events, require varying levels of infrastructure and investment, and have shown strong regional dominance, allowing for meaningful comparisons of how different countries succeed in Olympic competition. These sports also offer insights into gender diversity trends, as they have experienced varying timelines of women’s inclusion, from early adoption to recent expansion. By analyzing them, we can examine how policies, cultural attitudes, and structural barriers have shaped gender representation over time. We chose to focus solely on Summer Olympic sports, as the Winter Olympics tend to be less globally diverse, with athletes from warmer climates and less developed nations facing significant barriers to participation due to limited access to training and equipment.

For Question 1, we plan to answer our question through analyzing trends in male and female participation in Summer Olympics across all sports and all years. We plan on extracting the Sex, Year, Sport, and Event variables from the dataset to help us reach a conclusion to our research question. In order to gain more insight into our research question, we plan on quantifying gender diversity. This requires us to properly sort and filter out unnecessary data that we may not need, such as height, weight, city, and potentially other variables from the dataset. Additionally, we will have to go through the data and check for any possible discrepancies, and remove any incomplete entries if they exist. We will also need to calculate the percentage of male and female athletes for every Olympic games and for each sport which helps us analyze trends over time. To determine the key factors influencing gender diversity shifts, we will track changes in participation percentages and correlate them with external factors such as Olympic policy changes, the introduction of new women’s events, and country-specific social movements. We will analyze whether these factors coincide with significant shifts in gender diversity and assess which had the greatest impact. We are also considering evaluating key milestones that could have influenced gender diversity, specifically how it has impacted the Olympic games. This could be events such as Olympic policy changes, introduction of new women’s events, and potentially other movements that are country specific that have impacted the world as a whole. This will be largely focused in our discussion and analysis as a point of support. To visualize our answer to this research question, we are deciding between a line chart or a stacked bar graph to easily show the change in representation over the years.

For Question 2, our plan is to examine any relationship between an athlete’s country and their likelihood of winning a medal. To do this, we are going to utilize the NOC (National Olympic Country), Medal, and Season variables from the olympic dataset. Potential variables that we have discussed creating is a medal rate per country, which will help determine the percentage of athletes who are successful per each country so that we can normalize the figures given the differences in total numbers. After looking at the dataset, we’ve identified that a number of rows have “NA” in regards to medals so it is a key point in our analysis that it is not a case of missing or messy data but rather the way that the information is noted. Additionally, we will create a variable called success rate that identifies which sports a country consistently performs well in. To ensure our analysis accounts for historical differences, we will segment our data by time periods rather than aggregating all years together. This will help us control for changes in Olympic participation, competitiveness, and geopolitical shifts. We will also track each country’s first Olympic appearance and compare medal rates across different eras. To effectively present our findings, we plan to create bar charts comparing medal rates across countries to highlight nations with the highest success rates, heatmaps showing country dominance in specific sports, line graphs tracking medal-winning performance over time to visualize historical trends, and stacked bar charts comparing medal rates across different time periods to highlight shifts in dominance.