Project proposal

Author

Team Trusting-Hedgehog

library(tidyverse)

#NOTICE: I downloaded the datasets and placed them under /data as well.
cases_month <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-24/cases_month.csv'
)
cases_year <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-24/cases_year.csv'
)
dim(cases_month)
[1] 22780    15
colnames(cases_month)
 [1] "region"                "country"               "iso3"                 
 [4] "year"                  "month"                 "measles_suspect"      
 [7] "measles_clinical"      "measles_epi_linked"    "measles_lab_confirmed"
[10] "measles_total"         "rubella_clinical"      "rubella_epi_linked"   
[13] "rubella_lab_confirmed" "rubella_total"         "discarded"            
glimpse(cases_month)
Rows: 22,780
Columns: 15
$ region                <chr> "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR",…
$ country               <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alg…
$ iso3                  <chr> "DZA", "DZA", "DZA", "DZA", "DZA", "DZA", "DZA",…
$ year                  <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, …
$ month                 <dbl> 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 1, 2, 3, 4, 1, …
$ measles_suspect       <dbl> 8, 10, 17, 7, 14, 10, 1, 1, 3, 5, 38, 26, 16, 5,…
$ measles_clinical      <dbl> 6, 10, 17, 5, 11, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, …
$ measles_epi_linked    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ measles_lab_confirmed <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ measles_total         <dbl> 8, 10, 17, 5, 11, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, …
$ rubella_clinical      <dbl> NA, NA, NA, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA…
$ rubella_epi_linked    <dbl> NA, NA, NA, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA…
$ rubella_lab_confirmed <dbl> NA, NA, NA, 1, 3, 8, NA, NA, NA, 1, 22, 7, NA, N…
$ rubella_total         <dbl> NA, NA, NA, 1, 3, 8, NA, NA, NA, 1, 22, 7, NA, N…
$ discarded             <dbl> 0, 0, 0, 2, 3, 10, 1, 0, 0, 5, 38, 26, 16, 5, 3,…
dim(cases_year)
[1] 2382   19
colnames(cases_year)
 [1] "region"                                                         
 [2] "country"                                                        
 [3] "iso3"                                                           
 [4] "year"                                                           
 [5] "total_population"                                               
 [6] "annualized_population_most_recent_year_only"                    
 [7] "total_suspected_measles_rubella_cases"                          
 [8] "measles_total"                                                  
 [9] "measles_lab_confirmed"                                          
[10] "measles_epi_linked"                                             
[11] "measles_clinical"                                               
[12] "measles_incidence_rate_per_1000000_total_population"            
[13] "rubella_total"                                                  
[14] "rubella_lab_confirmed"                                          
[15] "rubella_epi_linked"                                             
[16] "rubella_clinical"                                               
[17] "rubella_incidence_rate_per_1000000_total_population"            
[18] "discarded_cases"                                                
[19] "discarded_non_measles_rubella_cases_per_100000_total_population"
glimpse(cases_year)
Rows: 2,382
Columns: 19
$ region                                                          <chr> "AFRO"…
$ country                                                         <chr> "Alger…
$ iso3                                                            <chr> "DZA",…
$ year                                                            <dbl> 2012, …
$ total_population                                                <dbl> 376461…
$ annualized_population_most_recent_year_only                     <dbl> 376461…
$ total_suspected_measles_rubella_cases                           <dbl> 76, 85…
$ measles_total                                                   <dbl> 55, 0,…
$ measles_lab_confirmed                                           <dbl> 2, 0, …
$ measles_epi_linked                                              <dbl> 0, 0, …
$ measles_clinical                                                <dbl> 53, 0,…
$ measles_incidence_rate_per_1000000_total_population             <dbl> 1.46, …
$ rubella_total                                                   <dbl> 13, 29…
$ rubella_lab_confirmed                                           <dbl> 13, 29…
$ rubella_epi_linked                                              <dbl> 0, 0, …
$ rubella_clinical                                                <dbl> 0, 0, …
$ rubella_incidence_rate_per_1000000_total_population             <dbl> 0.35, …
$ discarded_cases                                                 <dbl> 8, 56,…
$ discarded_non_measles_rubella_cases_per_100000_total_population <dbl> 0.02, …

Dataset

The dataset used in this project is from the TidyTuesday Measles cases across the world repository (https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-06-24/readme.md), which originates from the World Health Organization (WHO) provisional measles and rubella surveillance reports. It contains global records of measles and rubella cases reported by different countries and regions over time. The data is divided into two tables: a monthly dataset and a yearly aggregated dataset.

The monthly table has 22780 rows and 15 columns, covering 193 countries across 6 regions. The time span of the monthly data ranges from 2012 to 2025. The variables included in the monthly dataset are: region, country, iso3, year, month, measles_suspect, measles_clinical, measles_epi_linked, measles_lab_confirmed, measles_total, rubella_clinical, rubella_epi_linked, rubella_lab_confirmed, rubella_total, discarded.

The yearly dataset contains 2382 observations and 19 variables with 194 unique countries and 14 years of aggregated data. The variables included in the yearly dataset are: region, country, iso3, year, total_population, annualized_population_most_recent_year_only, total_suspected_measles_rubella_cases, measles_total, measles_lab_confirmed, measles_epi_linked, measles_clinical, measles_incidence_rate_per_1000000_total_population, rubella_total, rubella_lab_confirmed, rubella_epi_linked, rubella_clinical, rubella_incidence_rate_per_1000000_total_population, discarded_cases, discarded_non_measles_rubella_cases_per_100000_total_population .

Across both tables, the dataset provides a mix of categorical variables such as country and numerical variables such as case counts. The total measles cases recorded in the yearly dataset sum to approximately 3,093,274.

We chose this dataset because it includes time, country, and disease numbers all together, which makes it good for making different kinds of charts in ggplot. The data is big enough to show clear patterns, but it is also clean, so we do not need do too much cleaning or preprocessing if we are going to apply machine learning models. It also has both monthly and yearly tables, so we can look at short-term seasonal changes and also long-term trends. Another reason we chose this dataset is that measles and rubella are diseases that exist in many parts of the world and can still be dangerous, especially when vaccination rates drop. Therefore cornell has mendatory requirement to take the vaccine of measles and rubellas.By seeing how case numbers change across regions and over time, we can better understand why public health monitoring is important and why people should stay cautious and informed about global health risks.

Questions

Question 1: How have measles case numbers changed over time in different world regions?

Question 2: Is there a relationship between the total measles cases and the percentage of laboratory-confirmed cases in each country?

Analysis plan

For Question 1, we will combine both the cases_year and cases_month datasets so we can study measles trends in the short term or in the long term. The cases_year dataset can be used for long-term study. We will focus on the variables year, region, and measles_total. After we finish some basic cleaning, we may start by grouping the data by year and region to calculate the total measles cases for each region in each year. Then we plan to create different charts using ggplot2 to show the relationship between years and measles case counts. We are planning to use different colors or facets to separate regions. We aim to build visualizations that can help us determine whether measles cases are increasing, decreasing, or staying stable in different parts of the world.At the same time, we will use the cases_month dataset to explore short-term or seasonal patterns, focusing on month, year, region, and measles_total. By aggregating monthly totals and plotting them as line charts or faceted plots, we can see if certain months tend to have higher outbreaks. Finally, we will use plots from both dataset to come up with a conclusion.

For Question 2, we will mostly use the cases_year dataset because it already summarizes the data at the country level, which makes comparison easier. The main columns we will use are country, measles_total, and measles_lab_confirmed. We will create a new variable called lab_ratio, which is measles_lab_confirmed / measles_total. This ratio helps us compare countries more fairly, since some countries have much larger case numbers than others. Then we plan to make scatter plots or bar charts in ggplot2 to see how total cases and confirmation ratios relate to each other, and we may use color to show different regions. The cases_month dataset might be used only as a quick reference if we want to see whether short-term spikes affect the ratio, but our main analysis will stay on the yearly data.