# A tibble: 271,116 × 15
id name sex age height weight team noc games year season city
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Dijia… M 24 180 80 China CHN 1992… 1992 Summer Barc…
2 2 A Lamusi M 23 170 60 China CHN 2012… 2012 Summer Lond…
3 3 Gunnar … M 24 NA NA Denm… DEN 1920… 1920 Summer Antw…
4 4 Edgar L… M 34 NA NA Denm… DEN 1900… 1900 Summer Paris
5 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
6 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
7 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
8 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
9 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
10 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
# ℹ 271,106 more rows
# ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>
Dataset
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
Make sure to load the data and use inline code for some of this information.
The Olympics dataset was taken from OlympicData and has 271116 rows and 15 columns. We chose this dataset as it clearly outlines all of relevant columns that we may need such as the country, location of the game. The main important columns that are included are the sports,event and medals columns. The dataset also requires very little cleaning so it would give more time to come up with better visualizations to better fit it.
library(tidyverse)
Questions
How has the participation of developing countries in the Olympics changed over time?
What is the age distribution of gold, silver, and bronze medals for contact sports like Judo or Boxing?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
How has the participation of low-income and lower-middle-income countries in the Olympics changed over time?
The main relevant columns will include the id column, noc (country identifier), and year. Since this question explores only low-income and lower-middle-income countries, we will need an outside dataset and metrics to determine what classifies a country as low-income and lower-middle-income. We identified a dataset from Tidy Tuesday’s World Bureaucracy Index, which lists the income-level of every country as of their latest record date. We recognize the limitations of these data, as the definition of these countries We will need to merge this dataset and use it to classify each country as low-income/lower-middle income or not (1 or 0). The merged dataset can then be subsetted to only account for low-income and lower-middle-income nations. The extra dataset we plan to use is the Worldwide Bureaucracy Index from Tidyverse Tuesdays 2024 folder. The WBI contains information regarding each country’s income level (from low-income to high-income). Low income and medium countries will be delegated as “developing”, while high income countries will be considered “developed.” Naturally, this may be somewhat vague in the data, so for specific countries which may have high income, but are still considered “developing” on the global stage, we will make exceptions.
In order to visualize participation, one potential method is to sum the id column to measure the amount of athletes that participate from each country. We can use a line graph to visualize this, with the sum of the athletes being represented on the y axis, and year being represented on the x axis, since we are graphing over time. The data will be grouped by country. Another option could be to use proportional analysis by measuring the proportion of athletes from each country. This could be done by using a stacked bar chart. We can also facet by sport to display each country’s participation across the years by sport. This will add another level of nuance that could help our understanding of participation across developing countries.
What is the age distribution of gold, silver, and bronze medals for contact sports like Judo or Boxing?
We will have to: Filter the dataset to gather all sports. We will then remove missing age values. After that, we will group by “Medal” and summarize mean, median, standard deviation, min/max age, and interquartile range of age.