Project proposal

Author

Team Trusting-Hedgehog

library(tidyverse)

#NOTICE: I downloaded the datasets and placed them under /data as well.

# load data set by read_csv
# load cases_month data
cases_month <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-24/cases_month.csv'
)
# load cases_year data
cases_year <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-24/cases_year.csv'
)
dim(cases_month)
[1] 22780    15
colnames(cases_month)
 [1] "region"                "country"               "iso3"                 
 [4] "year"                  "month"                 "measles_suspect"      
 [7] "measles_clinical"      "measles_epi_linked"    "measles_lab_confirmed"
[10] "measles_total"         "rubella_clinical"      "rubella_epi_linked"   
[13] "rubella_lab_confirmed" "rubella_total"         "discarded"            
glimpse(cases_month)
Rows: 22,780
Columns: 15
$ region                <chr> "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR",…
$ country               <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alg…
$ iso3                  <chr> "DZA", "DZA", "DZA", "DZA", "DZA", "DZA", "DZA",…
$ year                  <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, …
$ month                 <dbl> 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 1, 2, 3, 4, 1, …
$ measles_suspect       <dbl> 8, 10, 17, 7, 14, 10, 1, 1, 3, 5, 38, 26, 16, 5,…
$ measles_clinical      <dbl> 6, 10, 17, 5, 11, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, …
$ measles_epi_linked    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ measles_lab_confirmed <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ measles_total         <dbl> 8, 10, 17, 5, 11, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, …
$ rubella_clinical      <dbl> NA, NA, NA, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA…
$ rubella_epi_linked    <dbl> NA, NA, NA, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA…
$ rubella_lab_confirmed <dbl> NA, NA, NA, 1, 3, 8, NA, NA, NA, 1, 22, 7, NA, N…
$ rubella_total         <dbl> NA, NA, NA, 1, 3, 8, NA, NA, NA, 1, 22, 7, NA, N…
$ discarded             <dbl> 0, 0, 0, 2, 3, 10, 1, 0, 0, 5, 38, 26, 16, 5, 3,…
dim(cases_year)
[1] 2382   19
colnames(cases_year)
 [1] "region"                                                         
 [2] "country"                                                        
 [3] "iso3"                                                           
 [4] "year"                                                           
 [5] "total_population"                                               
 [6] "annualized_population_most_recent_year_only"                    
 [7] "total_suspected_measles_rubella_cases"                          
 [8] "measles_total"                                                  
 [9] "measles_lab_confirmed"                                          
[10] "measles_epi_linked"                                             
[11] "measles_clinical"                                               
[12] "measles_incidence_rate_per_1000000_total_population"            
[13] "rubella_total"                                                  
[14] "rubella_lab_confirmed"                                          
[15] "rubella_epi_linked"                                             
[16] "rubella_clinical"                                               
[17] "rubella_incidence_rate_per_1000000_total_population"            
[18] "discarded_cases"                                                
[19] "discarded_non_measles_rubella_cases_per_100000_total_population"
glimpse(cases_year)
Rows: 2,382
Columns: 19
$ region                                                          <chr> "AFRO"…
$ country                                                         <chr> "Alger…
$ iso3                                                            <chr> "DZA",…
$ year                                                            <dbl> 2012, …
$ total_population                                                <dbl> 376461…
$ annualized_population_most_recent_year_only                     <dbl> 376461…
$ total_suspected_measles_rubella_cases                           <dbl> 76, 85…
$ measles_total                                                   <dbl> 55, 0,…
$ measles_lab_confirmed                                           <dbl> 2, 0, …
$ measles_epi_linked                                              <dbl> 0, 0, …
$ measles_clinical                                                <dbl> 53, 0,…
$ measles_incidence_rate_per_1000000_total_population             <dbl> 1.46, …
$ rubella_total                                                   <dbl> 13, 29…
$ rubella_lab_confirmed                                           <dbl> 13, 29…
$ rubella_epi_linked                                              <dbl> 0, 0, …
$ rubella_clinical                                                <dbl> 0, 0, …
$ rubella_incidence_rate_per_1000000_total_population             <dbl> 0.35, …
$ discarded_cases                                                 <dbl> 8, 56,…
$ discarded_non_measles_rubella_cases_per_100000_total_population <dbl> 0.02, …

Dataset

The dataset used in this project is from the TidyTuesday Measles cases across the world repository (https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-06-24/readme.md), which originates from the World Health Organization (WHO) provisional measles and rubella surveillance reports. It contains global records of measles and rubella cases reported by different countries and regions over time. The data is divided into two tables: a monthly dataset and a yearly aggregated dataset.

The monthly table has 22780 rows and 15 columns, covering 193 countries across 6 regions. The time span of the monthly data ranges from 2012 to 2025. The variables included in the monthly dataset are: region, country, iso3, year, month, measles_suspect, measles_clinical, measles_epi_linked, measles_lab_confirmed, measles_total, rubella_clinical, rubella_epi_linked, rubella_lab_confirmed, rubella_total, discarded.

The yearly dataset contains 2382 observations and 19 variables with 194 unique countries and 14 years of aggregated data. The variables included in the yearly dataset are: region, country, iso3, year, total_population, annualized_population_most_recent_year_only, total_suspected_measles_rubella_cases, measles_total, measles_lab_confirmed, measles_epi_linked, measles_clinical, measles_incidence_rate_per_1000000_total_population, rubella_total, rubella_lab_confirmed, rubella_epi_linked, rubella_clinical, rubella_incidence_rate_per_1000000_total_population, discarded_cases, discarded_non_measles_rubella_cases_per_100000_total_population .

Across both tables, the dataset provides a mix of categorical variables such as country and numerical variables such as case counts. The total measles cases recorded in the yearly dataset sum to approximately 3,093,274.

We chose this dataset because it includes time, country, and disease numbers all together, which makes it good for making different kinds of charts in ggplot. The data is big enough to show clear patterns, but it is also clean, so we do not need do too much cleaning or preprocessing if we are going to apply machine learning models. It also has both monthly and yearly tables, so we can look at short-term seasonal changes and also long-term trends. Another reason we chose this dataset is that measles and rubella are diseases that exist in many parts of the world and can still be dangerous, especially when vaccination rates drop. Therefore cornell has mendatory requirement to take the vaccine of measles and rubellas.By seeing how case numbers change across regions and over time, we can better understand why public health monitoring is important and why people should stay cautious and informed about global health risks.

Questions

Question 1:How have measles incidence rates (per population) changed over time across WHO regions, and is the recent resurgence in the United States part of a broader trend among high-income countries?

Question 2:Are measles resurgence patterns geographically clustered, and do certain WHO regions experience synchronized increases over time?

Analysis plan

Before answering any questions, we will first perform basic data cleaning. We will exclude observations with missing values in key variables such as incidence rate or total measles cases when necessary. Since population size varies a lot across countries, we will mainly focus on the variable measles_incidence_rate_per_1000000_total_population instead of raw case counts when comparing countries or regions. We will also check for extreme or inconsistent values and document any observations that are removed. We plan to use the full available time range unless early years show substantial missing or inconsistent data. All WHO regions will be included unless data limitations require adjustment.

For Question 1, we will mainly use the cases_year dataset to study long-term trends. We will focus on the variables year, region, country, and measles_incidence_rate_per_1000000_total_population. After cleaning the data, we will group the data by year and region to calculate the average incidence rate for each region in each year. Then we will create time-series line plots using ggplot2 to show how measles incidence rates change over time across different WHO regions. We will use different colors or facets to separate regions so that patterns can be compared clearly.

To examine whether the recent resurgence in the United States is part of a broader trend among high-income countries, we will compare the United States with selected high-income countries and visualize their incidence rates over time. By placing them on the same time-series plot, we can observe whether they show similar increases in recent years. This helps us determine whether the US pattern is unique or reflects a wider resurgence.

At the same time, we will use the cases_month dataset to explore short-term or seasonal patterns. We will focus on month, year, region, and measles_total, and aggregate monthly totals to visualize whether certain months tend to show higher outbreaks. This helps us distinguish between seasonal variation and long-term upward trends.

For Question 2, we will investigate whether measles resurgence patterns are geographically clustered and whether certain WHO regions experience synchronized increases over time. We will use the cases_year dataset and focus on region, country, year, and incidence rate. First, we will create choropleth maps using country iso3 codes to visualize the spatial distribution of measles incidence in selected years. By comparing maps from different years, we can observe whether outbreaks are concentrated in particular regions.

Second, we will plot incidence rates for all WHO regions on the same time-series graph to examine synchronization. If multiple regions experience increases during the same period, this may indicate a broader global resurgence rather than isolated regional events.

In addition to visualization, we will use correlation analysis and simple regression models to formally test time trends and regional differences in incidence rates. All findings will be interpreted carefully, since the dataset reflects reported surveillance data rather than direct measures of vaccination coverage or other causal factors.