renv::restore()- The library is already synchronized with the lockfile.
library(tidyverse)Proud Otter, Ahmed Abdulla, Amelia Garcia, Minha Kim
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
Make sure to load the data and use inline code for some of this information.
Rows: 5117 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country, g_whoregion, iso2, iso3
dbl (14): iso_numeric, year, c_cdr, c_newinc_100k, cfr, e_inc_100k, e_inc_nu...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 5,117
Columns: 18
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Af…
$ g_whoregion <chr> "Eastern Mediterranean", "Eastern Mediterranean"…
$ iso_numeric <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", …
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ year <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, …
$ c_cdr <dbl> 19, 26, 34, 32, 41, 47, 53, 59, 56, 50, 52, 50, …
$ c_newinc_100k <dbl> 35, 50, 65, 61, 78, 90, 100, 111, 107, 95, 99, 9…
$ cfr <dbl> 0.37, 0.35, 0.31, 0.32, 0.28, 0.26, 0.24, 0.21, …
$ e_inc_100k <dbl> 190, 189, 189, 189, 189, 189, 189, 189, 189, 189…
$ e_inc_num <dbl> 38000, 38000, 40000, 43000, 44000, 46000, 48000,…
$ e_mort_100k <dbl> 68.00, 63.00, 57.00, 58.00, 52.00, 47.00, 43.00,…
$ e_mort_exc_tbhiv_100k <dbl> 68.00, 63.00, 57.00, 58.00, 51.00, 47.00, 43.00,…
$ e_mort_exc_tbhiv_num <dbl> 14000, 13000, 12000, 13000, 12000, 11000, 11000,…
$ e_mort_num <dbl> 14000, 13000, 12000, 13000, 12000, 12000, 11000,…
$ e_mort_tbhiv_100k <dbl> 0.17, 0.30, 0.27, 0.29, 0.29, 0.31, 0.32, 0.33, …
$ e_mort_tbhiv_num <dbl> 34, 61, 58, 66, 67, 75, 82, 85, 96, 110, 120, 14…
$ e_pop_num <dbl> 20130323, 20284311, 21378110, 22733047, 23560660…
The dataset we have chosen to use is the World Health Organization Tuberculosis (TB) Burden Data, which was uploaded to Tidy Tuesday on November 11, 2025. The dataset comes from the getTBinR packages by Sam Abbott In the dataset, there are 5117 rows and 18 columns. The dataset comprises 215 countries with entries spanning from 2000 to 2023. Of these, the following countries have incomplete temporal coverage: Curaçao, Montenegro, Serbia, Sint Maarten (Dutch part), South Sudan, Timor-Leste. South Sudan and Timor-Leste lack earlier data due to them gaining independence in 2011 and 2002, respectively. The countries are classified into the following WHO-designated regions: Eastern Mediterranean, Europe, Africa, Western Pacific, Americas, South-East Asia.
While vaccines and effective treatments of tuberculosis have existed for several decades, TB persists as a public health threat globally. Contemporary TB is predominantly a socio-economic disease, disproportionately affecting populations with inadequate living or working conditions and limited access to regular medical screening. We have chosen this dataset in order to explore which factors contribute to, is affected by, or indirectly indicate trends in contemporary TB. The dataset includes entries from 2000 to 2023, providing valuable information on the trends of TB cases over the past two decades. Furthermore, the dataset covers multiple countries around the globe, providing valuable information on TB trends across diverse climatic, welfare, healthcare accessibility, and socioeconomic contexts.
The two questions you want to answer.
Africa, Western Pacific, and Southeast Asia are the most effected regions by Tuberculosis, according to WHO. Thus, we can narrow our focus down to these regions that experience cases most harshly to get a clearer picture of the situation in these regions. From 2003 to 2023, which regions of Africa, Western Pacific, and Southeast Asia saw the greatest change in case mortalities per 100,000 population given recorded TB cases per 100,000 population?
Does a high Case Detection Rate necessarily correlate to a low Mortality Rate? Alternatively, are there countries or regions where the limiting factor for TB treatment does not seem to be screening or detection but other factors?
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
e_inc_100k - Estimated incidence (all forms) per 100 000 populatione_mort_100k - Estimated mortality of TB cases (all forms) per 100 000 populationc_cdr - Detection rate of TB cases (all forms) as percentageg_whoregion - Geographical region classified by WHOe_diff_100k - Estimated estimated mortality divided by estimated incidence, defined as being the estimated percent of mortalities by incidences of TB For this question, the variables involved in analysis include year, country, e_inc_100k, and e_mort_100k (mortalities per 100k population). In order to do this, we will filter the dataframe to only contain entries that have g_whoregion (the dataset variable to delineate geographic region of the country) entries of “Africa”, “Southeast Asia”, and “Western Pacific”. Further, we will aditionally need to filter out data from 2001, 2002, and 2000 since this question focuses only on the two decades within 2003 to 2023. For the comparision of mortalities vs cases, we will create a new variable, e_diff_100k by dividing e_mort_100k by e_inc_100k.We will not need to merge any external data given that all of the information required is within the dataset. To complete the analysis, we plan to create a line graph of e_diff_100k across the two decade period, color coding by region.
country g_whoregion iso_numeric
0 0 0
iso2 iso3 year
0 0 0
c_cdr c_newinc_100k cfr
109 65 44
e_inc_100k e_inc_num e_mort_100k
0 0 0
e_mort_exc_tbhiv_100k e_mort_exc_tbhiv_num e_mort_num
0 0 0
e_mort_tbhiv_100k e_mort_tbhiv_num e_pop_num
0 0 0
There are no missing values for the columns necessary for question one.
g_whoregion - Geographical region classified by WHOc_cdr - TB detection ratee_pop_num - Estimated populationTo answer whether high case detection actually leads to lower death rates, we will filter the dataset to the year 2023 to provide a current snapshot of global TB control, this will also help with avoiding any outlier events during COVID-19. The analysis will use c_cdr (TB detection rate) as the independent variable on the X-axis to represent health system effort, and Mortality Rate per 100k as the dependent variable on the Y-axis to represent outcomes. We will use g_whoregion to color-code countries by region and e_pop_num to size the data points,to see populous nations are visually distinct. No new variables or external datasets are required. The data contains all the necessary metrics to identify countries that deviate from the expected trend of “higher detection equals lower mortality”.
country g_whoregion iso_numeric
0 0 0
iso2 iso3 year
0 0 0
c_cdr c_newinc_100k cfr
10 9 3
e_inc_100k e_inc_num e_mort_100k
0 0 1
e_mort_exc_tbhiv_100k e_mort_exc_tbhiv_num e_mort_num
1 1 1
e_mort_tbhiv_100k e_mort_tbhiv_num e_pop_num
1 1 0
# A tibble: 11 × 2
country na_cols
<chr> <chr>
1 American Samoa c_cdr, c_newinc_100k
2 Anguilla c_cdr, c_newinc_100k
3 British Virgin Islands c_cdr, cfr
4 Curaçao c_cdr, c_newinc_100k
5 Democratic People's Republic of Korea e_mort_100k, e_mort_exc_tbhiv_100k, e_…
6 Djibouti c_cdr, c_newinc_100k
7 Latvia c_cdr, c_newinc_100k
8 Monaco c_cdr, c_newinc_100k
9 Montserrat c_cdr, c_newinc_100k, cfr
10 San Marino c_cdr, c_newinc_100k, cfr
11 Wallis and Futuna c_cdr, c_newinc_100k
We identified 11 countries with missing TB detection rate data, central to answering question 2. These countries were excluded from the Question 2 analysis, as their limited number is unlikely to significantly impact our exploration. This absence of the 11 countries will be thoroughly noted in our project and findings as we cannot presume to understand how these countries would impact the data.