Project proposal

Author

Proud Otter, Ahmed Abdulla, Amelia Garcia, Minha Kim

renv::restore()
- The library is already synchronized with the lockfile.
library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

Loading in Dataset

# Load the dataset
tb_data <- read_csv("data/who_tb_data.csv")
Rows: 5117 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): country, g_whoregion, iso2, iso3
dbl (14): iso_numeric, year, c_cdr, c_newinc_100k, cfr, e_inc_100k, e_inc_nu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(tb_data)
Rows: 5,117
Columns: 18
$ country               <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ g_whoregion           <chr> "Eastern Mediterranean", "Eastern Mediterranean"…
$ iso_numeric           <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ iso2                  <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ year                  <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, …
$ c_cdr                 <dbl> 19, 26, 34, 32, 41, 47, 53, 59, 56, 50, 52, 50, …
$ c_newinc_100k         <dbl> 35, 50, 65, 61, 78, 90, 100, 111, 107, 95, 99, 9…
$ cfr                   <dbl> 0.37, 0.35, 0.31, 0.32, 0.28, 0.26, 0.24, 0.21, …
$ e_inc_100k            <dbl> 190, 189, 189, 189, 189, 189, 189, 189, 189, 189…
$ e_inc_num             <dbl> 38000, 38000, 40000, 43000, 44000, 46000, 48000,…
$ e_mort_100k           <dbl> 68.00, 63.00, 57.00, 58.00, 52.00, 47.00, 43.00,…
$ e_mort_exc_tbhiv_100k <dbl> 68.00, 63.00, 57.00, 58.00, 51.00, 47.00, 43.00,…
$ e_mort_exc_tbhiv_num  <dbl> 14000, 13000, 12000, 13000, 12000, 11000, 11000,…
$ e_mort_num            <dbl> 14000, 13000, 12000, 13000, 12000, 12000, 11000,…
$ e_mort_tbhiv_100k     <dbl> 0.17, 0.30, 0.27, 0.29, 0.29, 0.31, 0.32, 0.33, …
$ e_mort_tbhiv_num      <dbl> 34, 61, 58, 66, 67, 75, 82, 85, 96, 110, 120, 14…
$ e_pop_num             <dbl> 20130323, 20284311, 21378110, 22733047, 23560660…
# Count years of data per country
data_completeness <- tb_data %>%
  group_by(country) %>%
  summarise(
    n_years = n_distinct(year),
    complete = n_years == 24
  )

WHO TB Dataset

The dataset we have chosen to use is the World Health Organization Tuberculosis (TB) Burden Data, which was uploaded to Tidy Tuesday on November 11, 2025. The dataset comes from the getTBinR packages by Sam Abbott In the dataset, there are 5117 rows and 18 columns. The dataset comprises 215 countries with entries spanning from 2000 to 2023. Of these, the following countries have incomplete temporal coverage: Curaçao, Montenegro, Serbia, Sint Maarten (Dutch part), South Sudan, Timor-Leste. South Sudan and Timor-Leste lack earlier data due to them gaining independence in 2011 and 2002, respectively. The countries are classified into the following WHO-designated regions: Eastern Mediterranean, Europe, Africa, Western Pacific, Americas, South-East Asia.

Why have we chosen it?

While vaccines and effective treatments of tuberculosis have existed for several decades, TB persists as a public health threat globally. Contemporary TB is predominantly a socio-economic disease, disproportionately affecting populations with inadequate living or working conditions and limited access to regular medical screening. We have chosen this dataset in order to explore which factors contribute to, is affected by, or indirectly indicate trends in contemporary TB. The dataset includes entries from 2000 to 2023, providing valuable information on the trends of TB cases over the past two decades. Furthermore, the dataset covers multiple countries around the globe, providing valuable information on TB trends across diverse climatic, welfare, healthcare accessibility, and socioeconomic contexts.

Questions

The two questions you want to answer.

Question 1

Africa, Western Pacific, and Southeast Asia are the most effected regions by Tuberculosis, according to WHO. Thus, we can narrow our focus down to these regions that experience cases most harshly to get a clearer picture of the situation in these regions. From 2003 to 2023, which regions of Africa, Western Pacific, and Southeast Asia saw the greatest change in case mortalities per 100,000 population given recorded TB cases per 100,000 population?

Question 2

Does a high Case Detection Rate necessarily correlate to a low Mortality Rate? Alternatively, are there countries or regions where the limiting factor for TB treatment does not seem to be screening or detection but other factors?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1 Analysis

Variables to note

  1. e_inc_100k - Estimated incidence (all forms) per 100 000 population
  2. e_mort_100k - Estimated mortality of TB cases (all forms) per 100 000 population
  3. c_cdr - Detection rate of TB cases (all forms) as percentage
  4. g_whoregion - Geographical region classified by WHO
  5. (new variable) e_diff_100k - Estimated estimated mortality divided by estimated incidence, defined as being the estimated percent of mortalities by incidences of TB For this question, the variables involved in analysis include year, country, e_inc_100k, and e_mort_100k (mortalities per 100k population). In order to do this, we will filter the dataframe to only contain entries that have g_whoregion (the dataset variable to delineate geographic region of the country) entries of “Africa”, “Southeast Asia”, and “Western Pacific”. Further, we will aditionally need to filter out data from 2001, 2002, and 2000 since this question focuses only on the two decades within 2003 to 2023. For the comparision of mortalities vs cases, we will create a new variable, e_diff_100k by dividing e_mort_100k by e_inc_100k.

We will not need to merge any external data given that all of the information required is within the dataset. To complete the analysis, we plan to create a line graph of e_diff_100k across the two decade period, color coding by region.

Addressing Missing Values

# how we plan to filter for specific regions
filtered_regions <- tb_data |>
  filter(
    g_whoregion %in% c("Africa", "Western Pacific", "Southeast Asia")
  )

# checking for missing values

colSums(is.na(filtered_regions))
              country           g_whoregion           iso_numeric 
                    0                     0                     0 
                 iso2                  iso3                  year 
                    0                     0                     0 
                c_cdr         c_newinc_100k                   cfr 
                  109                    65                    44 
           e_inc_100k             e_inc_num           e_mort_100k 
                    0                     0                     0 
e_mort_exc_tbhiv_100k  e_mort_exc_tbhiv_num            e_mort_num 
                    0                     0                     0 
    e_mort_tbhiv_100k      e_mort_tbhiv_num             e_pop_num 
                    0                     0                     0 

There are no missing values for the columns necessary for question one.

Question 2 Analysis

Variables to Note

  1. g_whoregion - Geographical region classified by WHO
  2. c_cdr - TB detection rate
  3. e_pop_num - Estimated population

To answer whether high case detection actually leads to lower death rates, we will filter the dataset to the year 2023 to provide a current snapshot of global TB control, this will also help with avoiding any outlier events during COVID-19. The analysis will use c_cdr (TB detection rate) as the independent variable on the X-axis to represent health system effort, and Mortality Rate per 100k as the dependent variable on the Y-axis to represent outcomes. We will use g_whoregion to color-code countries by region and e_pop_num to size the data points,to see populous nations are visually distinct. No new variables or external datasets are required. The data contains all the necessary metrics to identify countries that deviate from the expected trend of “higher detection equals lower mortality”.

Addressing Missing Values

q2_filtered <- filter(tb_data, year == 2023)

# checking for missing values
colSums(is.na(q2_filtered))
              country           g_whoregion           iso_numeric 
                    0                     0                     0 
                 iso2                  iso3                  year 
                    0                     0                     0 
                c_cdr         c_newinc_100k                   cfr 
                   10                     9                     3 
           e_inc_100k             e_inc_num           e_mort_100k 
                    0                     0                     1 
e_mort_exc_tbhiv_100k  e_mort_exc_tbhiv_num            e_mort_num 
                    1                     1                     1 
    e_mort_tbhiv_100k      e_mort_tbhiv_num             e_pop_num 
                    1                     1                     0 
# check which countries have which missing values
q2_filtered %>%
  filter(if_any(everything(), is.na)) %>%
  mutate(
    na_cols = apply(., 1, function(x) {
      paste(names(.)[is.na(x)], collapse = ", ")
    })
  ) %>%
  select(country, na_cols)
# A tibble: 11 × 2
   country                               na_cols                                
   <chr>                                 <chr>                                  
 1 American Samoa                        c_cdr, c_newinc_100k                   
 2 Anguilla                              c_cdr, c_newinc_100k                   
 3 British Virgin Islands                c_cdr, cfr                             
 4 Curaçao                               c_cdr, c_newinc_100k                   
 5 Democratic People's Republic of Korea e_mort_100k, e_mort_exc_tbhiv_100k, e_…
 6 Djibouti                              c_cdr, c_newinc_100k                   
 7 Latvia                                c_cdr, c_newinc_100k                   
 8 Monaco                                c_cdr, c_newinc_100k                   
 9 Montserrat                            c_cdr, c_newinc_100k, cfr              
10 San Marino                            c_cdr, c_newinc_100k, cfr              
11 Wallis and Futuna                     c_cdr, c_newinc_100k                   

We identified 11 countries with missing TB detection rate data, central to answering question 2. These countries were excluded from the Question 2 analysis, as their limited number is unlikely to significantly impact our exploration. This absence of the 11 countries will be thoroughly noted in our project and findings as we cannot presume to understand how these countries would impact the data.