── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Loading required package: gridExtra
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
Exploring Representation and Success in the Olympics
Rows: 271116 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, sex, team, noc, games, season, city, sport, event, medal
dbl (5): id, age, height, weight, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 230 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): NOC, region, notes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 196 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Continent, Country
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Introduction
This dataset contains data from 120 years of Olympic games. It contains results for each participants from the 1896 Olympics held in Athens, Greece to the 2016 Olympics held in Rio de Janeiro, Brazil. Within the dataset, there is an entry for each event an athlete competed in in their respective Olympic games. It contains data on the person them self, which includes their name, sex, age, height, and weight. Additionally, it includes data about the country they represent, the games in which they participated in that specific event, the year of the games, and the season of the game (summer or winter Olympics). Lastly, it holds information about the sport they play, the specific event the entry is for, and the medal they received for that specific event.
For our project, we are analyzing two trends throughout the Olympic Games. The first trend we are analyzing is how the representation of male and female athletes has changed throughout the games. Additionally, we are analyzing which countries showed the most growth in medal count throughout the history of the Olympic Games. For our analysis, the data columns we are interested in include the sex of the athlete, the team they played on, the year they competed, the games they competed in, and what medal they received in the event.
Question 1: How has the representation of male and female athletes changed over time?
Introduction
Our first question, “How has the representation of male and female athletes changed over time?”, explores the evolution of gender dynamics in the Olympic games from 1896 to 2016. The main goal of this question is to examine trends in gender representation and identity disparities, milestones, and progress towards gender equity in athletics. To answer this question, the parts of the dataset we will need to use are the games
(winter/summer competition), year
, sex
, and sport
variables. We will group the data by each of these to create an athlete_count
for each sport by competition (year/season combination) and gender.
Our visualizations will depict a general line plot to analyze the general trend in gender representation in the games and a ridge line plot faceted by sports to focus on gender representation in Track and Field, Swimming, and Gymnastics over time. We chose this question because we all watch the Olympics and are interested in the trends surrounding gender representation of athletes. We want to know how the participation of women in the Olympic games has changed throughout the years, especially knowing that the 2024 Paris Olympics was the first games with an equal number of male and female participants. By exploring this topic, we hope to better understand the gender dynamics and what progress has been made towards equality between men and women in the Olympics.
Approach
The first visualization we will make to address our question, “How has the representation of male and female athletes changed over time?”, is a line plot that shows the trend of male and female representation in percentages over time. The x-axis will represent Olympic years, and the y-axis will show the percentage of male and female athletes. There will be two lines (one for male athletes and one for female athletes) plotted with points at each Olympic year. This will allow us to easily see the general trend over time and identify when the representation of men and women either converges to equal percentages or diverges. We will also facet by Summer and Winter Games to see if there are any differences between the two types of Olympics as the Summer and Winter Olympics host different events that may be more geared one gender over another. We believe a line plot is the best choice for answering this question because it allows us to easily see the overall trend and any differences between the two types of Olympics, which will help provide a well-rounded view of how gender representation has changed in the Olympics.
Our second visualization for question one will be a ridgeline plot faceted by sport, focusing on gender representation in Athletics (Track & Field), Swimming, and Gymnastics over time. This plot is ideal for showing changes in gender participation trends because ridgeline plots effectively demonstrate the distribution and density of data over time, making it easy to see when participation increases or declines. By faceting the plot by sport, we can compare gender trends across different disciplines and identify sports where female participation increased at different rates. The density ridges will highlight shifts in representation, such as periods of rapid growth in female athletes, providing a clear visual narrative of how gender inclusion has evolved in the Olympics since 1896.
Analysis
# General cleaning of data
# Creates a table showing the gender count for each gender by sport for each Olympic Games
<- olympics |>
gender_all group_by(id, name, year, season) |>
summarize(sex = first(sex), sport = first(sport), .groups = "drop") |>
group_by(sex, sport, year, season) |>
summarize(gender_count = n())
`summarise()` has grouped output by 'sex', 'sport', 'year'. You can override
using the `.groups` argument.
gender_all
# A tibble: 1,395 × 5
# Groups: sex, sport, year [1,395]
sex sport year season gender_count
<chr> <chr> <dbl> <chr> <int>
1 F Alpine Skiing 1936 Winter 37
2 F Alpine Skiing 1948 Winter 40
3 F Alpine Skiing 1952 Winter 52
4 F Alpine Skiing 1956 Winter 59
5 F Alpine Skiing 1960 Winter 51
6 F Alpine Skiing 1964 Winter 55
7 F Alpine Skiing 1968 Winter 55
8 F Alpine Skiing 1972 Winter 50
9 F Alpine Skiing 1976 Winter 56
10 F Alpine Skiing 1980 Winter 66
# ℹ 1,385 more rows
# Visualization 1
# Creates a table calculating the male & female percentage for each Olympic games.
<- olympics |>
gender_all_general group_by(id, name, year, season) |>
summarize(sex = first(sex), .groups = "drop") |>
group_by(year, season, sex) |>
summarise(count = n(), .groups = "drop") |>
group_by(year, season) |>
mutate(percent = (count / sum(count)) * 100)
# Switching order for legend key
$sex <- factor(gender_all_general$sex, levels = c("M", "F"))
gender_all_general
# Creates a line plot showing the percentage of male and female participants at the Olympic games.
ggplot(gender_all_general, aes(x = year, y = percent, color = sex, group = sex)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
scale_x_continuous(breaks = seq(min(gender_all_general$year), max(gender_all_general$year), by = 4)) +
facet_wrap(~ season, ncol = 1) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(title = "Gender Representation in the Olympics over Time",
subtitle = "By Summer and Winter from 1896 to 2016",
x = "(Summer) Olympic Year",
y = "Percentage of Female vs Male Athletes",
color = "Gender") +
scale_color_viridis_d(end = 0.8, direction = -1, labels = c("F" = "Female", "M" = "Male"))
# Visualization 2
# Creates new table getting each distinct athlete for each Olympics they participated in
# Used to account for athletes participating in multiple events in one Olympics
<- olympics |>
by_gender group_by(id, name, year, season) |>
summarize(sex = first(sex), sport = first(sport), .groups = "drop")
# Filters the previous table for the sports of interest
<- by_gender |>
gender_sport filter(sport %in% c("Athletics", "Swimming", "Gymnastics"))
# Creates a ridgeline plot depicting male and female percent participation among the 3 chosen sports
ggplot(gender_sport, aes(x = year, y = sex, fill = sex)) +
geom_density_ridges(scale = 5, alpha = 0.8) +
facet_grid(~ sport, scales = "fixed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_ridges() +
theme(
legend.position = "none",
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.title = element_text(size = 12),
+
) labs(
x = "Year",
y = "Sex",
title = "Gender Representation in the Olympics Overtime",
subtitle = "Participation in Athletics, Gymnastics, and Swimming Events"
+
) scale_fill_viridis_d(end=0.8)
Picking joint bandwidth of 3.26
Picking joint bandwidth of 5.69
Picking joint bandwidth of 4.21
Discussion
From the first visualization, we see a clear trend of increasing female representation in the Olympics over time. By 2016, the percentage of male and female athletes has nearly converged to a 50:50 ratio and by the 2024 Summer Olympics (not included in the dataset), it was at 50:50. From the plot, we see that there is no line for women until 1900, which makes sense since women were only first allowed to compete in the Olympics that year. Initially, their participation was limited to a few sports, including tennis, sailing, croquet, equestrianism, and golf. However, in 1928, women were allowed to compete in athletics (Track and Field) and gymnastics which is why we see a spike in women participation. In 1991, the IOC required that all new sports include women’s event which could account for the steady increase in women participation after 1991. Though both female and male representation seems to get very close to equal by 2016, the Summer Olympics more so than winter. This could be due to multiple factors, with one being that there is still one Winter Olympic sport (Nordic Combined) that is male only, and in the Summer Olympics there are no male-only events. Besides the general trend in our data, we see missing/extra data points on some years. This is because of historical events that has happened. In the plot, we see a extra point that does not match the 4 year sequence of Olympics in the year 1906, which is called th “Second International Olympic Games in Athens.” We have it in our data but it is important to note that the International Olympic Committee (IOC) does not recognize it as an official Olympic Game because it does not match the 4 year sequence. We see there are gaps in the data between 1940 and 1944 and this is because no Olympic games were held because of World War II. Lastly, the Summer and Winter Olympics originally occurred in the same year. After 1992, the Summer and Winter Olympics alternated every 2 years: Winter in 1994 and Summer in 1996. Therefore, the Olympic Years shown in the plot represents the current Summer Olympic Years.
For the second visualization, the analysis demonstrates distinct trends in gender representation across Athletics, Gymnastics, and Swimming in the Olympics. Female athletes did not emerge in any of these sports until around 1920, with Swimming being the earliest to include women in the 1910s. Athletics saw an initial rise in female participation after 1920, but this was followed by a decline over the next two decades. However, female representation in Athletics has steadily increased since then. Male participation in Athletics experienced a steep decline in the 1940s, due to World War II, before rebounding and then gradually declining again from the early 2000s. In Gymnastics, an inverse relationship between male and female participation was observed when women first competed in the sport in the late 1920’s, with male participation declining rapidly in the 1940’s, because of the war. However, after a surge in both genders around 1950, their participation trends have followed similar patterns, peaking around 1960 and 2000. These surges in distribution are related to the introduction of Rhythmic gymnastics in 1984 (female only) and Trampoline gymnastics in 2000 (male and female).
Swimming stands out as the sport with the most consistent gender representation trends compared to the other two plots, with both distributions closely mirroring each other. Although both genders have experienced fluctuations over time, participation in Swimming has generally increased since 1980. Compared to Athletics and Gymnastics, Swimming demonstrates the most balanced gender distribution from 1896 to 2016, which can be reflected in the growing number of female swimming events that mirror male events at the games. We can speculate that the data looks this way because of the rise of gender equality in recent decades, the introduction of more female events, and the various historical and societal shifts throughout the many years of the Olympic games.
Question 2: Which regions have shown the most improvement in medal counts over time?
Introduction
Our second question, “Which regions have shown the most improvement in medal counts over time?” investigates the changes in relative performance between competing regions as the Olympics, and the world, has developed over time. It aims to show region-based changes so that we can measure them against non-athletic developments over the same time period. To analyze this, we will use the `year`, `games`, `team`, and `medal` variables. However, we will convert medal earning into rates to address the variance in the number of athletes sent to the competition from each country. We chose this question because we are interested in how the value of athletics differs in different parts of the world. We also would like to know whether regions’ broader statuses are reflected in their athletic performances.
Approach
Our first plot will be a stacked bar graph that breaks down performance by continent. It will display each continent’s medal winning rate (total number of medals / total number of athletes) for each summer competition since 1944. This filtering is to ensure that all continents are represented fairly, as country representation was much less balanced in earlier games, and continues to be biased in the winter games due to climate differences. Additionally, the games were cancelled in 1940 due to World War II. The reason that a stacked bar graph will be the most effective here is because it will allow us to see relative performances for each year and therefore compare trends within and between continents.
Our second plot will be a line plot that breaks down the top 5 countries that have improved their medal rate the most since the 1972 olympics. The medal rate is the same as previously mentioned (total number of medals / total number of athletes). We once again used this method to ensure that each country was represented fairly, and we did not give an edge to countries with larger populations. In this plot, we also broke up the graphs by the season variable, creating one for winter and one for summer. We did this because different countries have vastly different performances in these games based on factors such as culture or climate of the countries. This graph will have a line for every country’s medal rates throughout the years, and will highlight the top 5 countries in how their medal rate grew from the 1972 to 2016 games. The reason why a line graph was chosen is because it is the best method to show how the medal rate changed throughout the years, as you can clearly see the increase/decrease for every 4 year olympic period. It also will allow you to notice trends with the highlighted countries and how they differ from the non highlighted countries.
Analysis
# Creates a table calculating the medal counts for each country, their total participants, and medal rate for each Olympic games.
<- olympics |>
medal_counts group_by(year, season, noc, event, medal) |>
summarise(count = 1, .groups = 'drop') |>
group_by(year, season, noc) |>
summarise(
gold = sum(medal == "Gold", na.rm = TRUE),
silver = sum(medal == "Silver", na.rm = TRUE),
bronze = sum(medal == "Bronze", na.rm = TRUE),
total_medals = gold + silver + bronze,
total_participants = n(),
.groups = 'drop'
|>
) left_join(noc_regions, by = c("noc" = "NOC")) |>
group_by(year, season, region) |>
summarise(
gold = sum(gold),
silver = sum(silver),
bronze = sum(bronze),
total_medals = sum(total_medals),
total_participants = sum(total_participants),
.groups = 'drop'
|>
) mutate(medal_rate = total_medals / total_participants)
medal_counts
# A tibble: 3,803 × 9
year season region gold silver bronze total_medals total_participants
<dbl> <chr> <chr> <int> <int> <int> <int> <int>
1 1896 Summer Australia 2 0 1 3 5
2 1896 Summer Austria 2 1 2 5 8
3 1896 Summer Denmark 1 2 3 6 12
4 1896 Summer France 5 4 2 11 23
5 1896 Summer Germany 7 5 2 14 35
6 1896 Summer Greece 10 17 17 44 72
7 1896 Summer Hungary 2 1 3 6 14
8 1896 Summer Italy 0 0 0 0 1
9 1896 Summer Sweden 0 0 0 0 5
10 1896 Summer Switzerland 1 2 0 3 8
# ℹ 3,793 more rows
# ℹ 1 more variable: medal_rate <dbl>
# Visualization 1
# Source for Country/Continent Conversion Dataset: https://www.kaggle.com/datasets/hserdaraltan/countries-by-continent?resource=download
# Creates a table that contains the medal rate for each continent for each summer Olympic games
<- left_join(medal_counts, noc_continent_conversion, by = c("region")) |>
medal_counts_continents mutate(Continent = case_when(
== "UK" ~ "Europe",
region == "USA" ~ "North America",
region == "Czech Republic" ~ "Europe",
region == "Bermuda" ~ "North America",
region == "Myanmar" ~ "Asia",
region == "Puerto Rico" ~ "North America",
region == "Curacao" ~ "South America",
region == "Trinidad" ~ "South America",
region == "Boliva" ~ "South America",
region == "Republic of Congo" ~ "Africa",
region == "Democratic Republic of the Congo" ~ "Africa",
region == "Bermuda" ~ "North America",
region == "Burkina Faso" ~ "Africa",
region == "Virgin Islands, US" ~ "North America",
region == "Virgin Islands, British" ~ "North America",
region == "Antigua" ~ "North America",
region == "Cayman Islands" ~ "North America",
region == "American Samoa" ~ "Oceania",
region == "Aruba" ~ "South America",
region == "Cook Islands" ~ "Oceania",
region == "Guam" ~ "Oceania",
region == "Saint Vincent" ~ "North America",
region == "Palestine" ~ "Asia",
region == "Saint Kitts" ~ "North America",
region == "Timor-Leste" ~ "Asia",
region TRUE ~ Continent
|>
)) filter((!is.na(Continent)) & (season == "Summer") & (year >= 1944)) |>
group_by(year, Continent) |>
summarize(
total_medals = sum(total_medals),
total_athletes = sum(total_participants)
|>
) mutate(
medal_rate = total_medals / total_athletes,
Continent = factor(Continent, levels=c("South America", "Africa", "Oceania", "Europe", "Asia", "North America"))
)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
medal_counts_continents
# A tibble: 108 × 5
# Groups: year [18]
year Continent total_medals total_athletes medal_rate
<dbl> <fct> <int> <int> <dbl>
1 1948 Africa 11 89 0.124
2 1948 Asia 5 166 0.0301
3 1948 Europe 298 1557 0.191
4 1948 North America 100 391 0.256
5 1948 Oceania 13 64 0.203
6 1948 South America 12 258 0.0465
7 1952 Africa 11 135 0.0815
8 1952 Asia 91 424 0.215
9 1952 Europe 241 1908 0.126
10 1952 North America 85 377 0.225
# ℹ 98 more rows
# Creates stacked bar chart showing medal rate percentage by continent
|>
medal_counts_continents ggplot(aes(y = factor(year), x = medal_rate, fill = Continent)) +
geom_col() +
scale_fill_viridis_d(option = "D", limits = c("North America", "Asia", "Europe", "Oceania", "Africa", "South America")) +
scale_x_continuous(labels = label_percent()) +
labs(
title = "Rate of Olympic Medal Earning by Continent",
fill = "Continent",
x = "Medal Rate (Medals per Participant)",
y = "Year"
+
) theme_minimal() +
theme(
legend.position = "top"
)
# Creates a new table that calculates the metal rate change, and ranks the top/bottom 5 countries for both season's Olympics
<- medal_counts |>
medal_growths filter(year %in% c(1972, 2014, 2016)) |>
select(region, year, season, medal_rate) |>
pivot_wider(names_from = year, values_from = medal_rate) |>
rename(medal_rate_start = `1972`, medal_rate_2016 = `2016`, medal_rate_2014 = `2014`) |>
mutate(
medal_rate_change = case_when(
== "Summer" ~ medal_rate_2016 - medal_rate_start,
season == "Winter" ~ medal_rate_2014 - medal_rate_start
season
)|>
) group_by(season) |>
arrange(desc(medal_rate_change)) |>
mutate(rank_increase = row_number()) |>
arrange(medal_rate_change) |>
mutate(rank_decrease = row_number()) |>
ungroup() |>
mutate(
category = case_when(
<= 5 ~ "Top 5 Increase",
rank_increase <= 5 ~ "Top 5 Decrease",
rank_decrease TRUE ~ "Other"
)
)
# Joins the previous table with the original table to ensure all variables needed for plotting as well as the top/bottom 5 is available in the data
<- medal_counts |>
medal_counts_colored left_join(medal_growths |> select(region, season, category), by = c("region", "season"))
# Summer graph, line graph showing medal rate by country over the years, with the top 5 highlighted
<- medal_counts_colored |>
g1 filter(season == "Summer", year >= 1972) |>
mutate(
category = if_else(category == "Top 5 Increase", region, "Other"),
category = fct_relevel(
.f = category, "China", "Denmark", "Ethiopia", "Jamaica", "UK"
)|>
) ggplot(mapping = aes(x = year, y = medal_rate, group = region, color = category, alpha = category)) +
geom_line() +
scale_alpha_manual(values = c("Other" = 0.1), guide = "none") +
scale_color_viridis_d(end = 0.8) +
scale_y_continuous(labels = label_percent()) +
coord_cartesian(ylim = c(0, 0.65)) +
labs(
title = "Summer Olympics - Top 5 Largest Increases in Medal Rate",
subtitle = "From Years 1972 to 2016",
x = "Year",
y = "Medal Rate (Medals per Participant)",
color = "Country"
+
) theme_minimal()
# Winter graph, line graph showing medal rate by country over the years, with the top 5 highlighted
<- medal_counts_colored |>
g2 filter(season == "Winter", year >= 1972) |>
mutate(
category = if_else(category == "Top 5 Increase", region, "Other"),
category = fct_relevel(
.f = category, "Canada", "Sweden", "France", "South Korea", "UK"
)|>
) drop_na(category) |>
ggplot(mapping = aes(x = year, y = medal_rate, group = region, color = category, alpha = category)) +
geom_line() +
scale_alpha_manual(values = c("Other" = 0.1), guide = "none") +
scale_color_viridis_d(end = 0.8) +
scale_y_continuous(labels = label_percent(), breaks = seq(0, .6, by = 0.2)) +
coord_cartesian(ylim = c(0, 0.65)) +
labs(
title = "Winter Olympics - Top 5 Largest Increase in Medal Rate",
subtitle = "From Years 1972 to 2016",
x = "Year",
y = "Medal Rate (Medals per Participant)",
color = "Country"
+
) theme_minimal()
<- ggarrange(g1, g2) graph
Discussion
For the first plot that breaks the data down by continent, we identified a few key trends and anomalies that reflect major world events that occurred at the time. First is the initial spike in Asia’s medal winning rate from 1946 to 1952, which can be explained by the participation of the Soviet Union for the first time in that year. The number of athletes participating from Asia was more than double that of the previous games. Next is the dip in medal winning rates in both Oceania and Africa from 1968 to 1976. Many African countries protested the 1976 competition due to New Zealand’s demonstrated support toward South Africa, which had been banned from the games due to its apartheid system. In 1980, Asia again saw a major spike due to a widespread boycott in response to the Soviet Union invading Afghanistan. A retaliatory boycott occurred for the following competition in 1984, and with the Soviet Union and its supporters not participating, North America’s medal winning rate spiked. Since then, the continents have settled into three distinct groups, with North America and Asia on top, Europe and Oceania in the middle, and Africa and South America at the bottom.
For the second plot that explores the top 5 countries in medal growth rate for both the summer and winter Olympics, we identified trends that occurred both throughout each of the years, and in the overall results. One trend we identified was that there are less high outliers for the recent years, meaning that a more diverse set of countries are winning medals for their participants sent for each year. This is true for both summer and winter, and is slightly clearer in the winter plot. Although this reason may be due to world progress, we also noticed it was linked to the political state of the world at the time. The games with the most skewed results was 1980, which was an Olympic games that over 60 countries including the U.S. boycotted, which caused the Soviet Union to have unnaturally high numbers. A similar event occurred for 1984, but it was countries associated with the Soviet Union that boycotted those games, meaning western powers were likely skewing the results. Additionally, based on the countries that were top 5 in both winter and summer games, we were able to split them into two categories: developing countries and countries that put a lot of funding into sports. We noticed the greatest improvements fell into these categories, with countries like Jamaica, Ethiopia, and South Korea falling into these categories. However, we were surprised to see existing powers like China, France, and the UK also being on the list. We believe that this is due to their strong emphasis on sports and the Olympics, as each of these countries pours a lot of funding into training programs for the games.
Throughout this analysis, the main trends we uncovered were events in the political state of the world, country development, and country emphasis on Olympic success. In the first graph, we were able to identify key events within the world that caused boycotts and affected the numbers. We also saw how different continents with developing countries increased their medal rates, and how all their rates interact today. For the second graph, we were able to identify two main factors that went into country medal rate success. We found that the two groups that showed up were either first world countries that place high emphasis on Olympic success, or developing countries that are beginning to find success within the games. Overall, this analysis has led to valuable insights about both the world, and the countries themselves.
Presentation
Our presentation can be found here.
Data
Griffin, R. (2018). 120 years of Olympic history: athletes and results [Data set]. Sports Reference. https://www.kaggle.com/datasets/heesoo37/ 120-years-of-olympic-history-athletes-and-results/
References
- Data source: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results/
- Olympic data: www.sports-reference.com