Anemia in Women

Exploratory data analysis

Research question(s)

Research question(s). State your research question (s) clearly.

Do women from developing countries have higher rates of anemia than developed countries?
How does age affect anemia rates in women?
How does pregnancy affect anemia rates in women?

Data collection and cleaning

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(readxl)

anaemia_estimates_inputdata_final <- read_excel("data/anaemia-estimates_inputdata_final.xlsx", 
    sheet = "data")
developing_countries <- c("Albania", "Armenia", "Azerbaijan", "Georgia", "North Macedonia", "Republic of Moldova", "Montenegro", "Romania", "Serbia", "Ukraine", "Egypt", "Morocco", "Tunisia", "Angola", "Benin", "Burkina Faso", "Burundi", "Cameroon", "Cabo Verde", "Central African Republic", "Chad", "Congo", "Democratic Republic of the Congo", "Cote d'Ivoire", "Equatorial Guinea", "Eswatini", "Ethiopia", "Gabon", "Gambia", "Ghana", "Guinea", "Kenya", "Lesotho", "Liberia", "Madagascar", "Malawi", "Mali", "Mauritius", "Mozambique", "Namibia", "Niger", "Nigeria", "Rwanda", "Sao Tome and Principe", "Senegal", "Sierra Leone", "Somalia", "South Africa", "Sudan", "United Republic of Tanzania", "Togo", "Uganda", "Zambia", "Zimbabwe", "Belize", "Costa Rica", "Cuba", "Dominica", "Dominican Republic", "El Salvador", "Guatemala", "Haiti", "Honduras", "Mexico", "Nicaragua", "Panama", "Argentina", "Bolivia (Plurinational State of)", "Brazil", "Colombia", "Ecuador", "Guyana", "Peru", "Afghanistan", "Bangladesh", "Bhutan", "Cambodia", "China", "India", "Indonesia", "Kazakhstan", "Democratic People's Republic of Korea", "Kyrgyzstan", "Lao People's Democratic Republic", "Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal", "Pakistan", "Philippines", "Sri Lanka", "Tajikistan", "Thailand", "Turkmenistan", "Uzbekistan", "Viet Nam", "Iran (Islamic Republic of)", "Iraq", "Jordan", "Lebanon", "Occupied Palestinian Territory", "Yemen", "Fiji", "Marshall Islands", "Nauru", "Papua New Guinea", "Samoa", "Solomon Islands", "Tuvalu", "Vanuatu", "Antigua and Barbuda", "Bahrain", "Central African Republic", "Micronesia (Federated States of)", "Oman", "Qatar", "Timor-Leste")
anaemia_estimates_inputdata_final |>
  filter(sex == 2) |>
  select(member_state, beginyear, sex, agerange, pregnancy, samplesize, mean, below130, below120, below110, below115, below100, below90, below80, below70) |>
  mutate(
    country_classification = if_else(member_state %in% developing_countries, "Developing", "Developed")
  )

# A tibble: 674 × 16
   member_state beginyear sex   agerange pregnancy    samplesize  mean below130
   <chr>            <dbl> <chr> <chr>    <chr>        <chr>      <dbl>    <dbl>
 1 Afghanistan       2002 2     15-49    all women    916          NA    NA    
 2 Afghanistan       2004 2     15-49    non-pregnant 1142        128    NA    
 3 Afghanistan       2013 2     15-49    non-pregnant 993         121.   NA    
 4 Afghanistan       2013 2     15-49    pregnant     144         118.   NA    
 5 Albania           2008 2     15-49    non-pregnant 7393        128.    0.537
 6 Albania           2008 2     17-43    pregnant     134         121.    0.795
 7 Albania           2017 2     15-49    non-pregnant 10339       127.    0.546
 8 Albania           2017 2     15-49    pregnant     280         117.    0.813
 9 Angola            2006 2     15-49    non-pregnant 2561        120.   NA    
10 Angola            2006 2     15-49    pregnant     345         111.   NA    
# ℹ 664 more rows
# ℹ 8 more variables: below120 <dbl>, below110 <dbl>, below115 <dbl>,
#   below100 <dbl>, below90 <dbl>, below80 <dbl>, below70 <dbl>,
#   country_classification <chr>

We found this data set on the WHO website. We did not need to do any scraping or surveys, we were able to download this data set as an excel file and upload it to r studio. To make this data fit our research questions better, we first filtered to include on the sex variable to include only women in our data set. We then selected only columns relevant to our research questions. We plan on creating another column indicating if a country is a developing nation or a developed nation, but we are in the process of finding a reputable data set including that data. Update: We created another column indicating whether a country is a developing nation or a developed nation through mutating the dataset and entering country classification based on whether the countries in the dataset were in the list of developing countries (in the “developing_countries” vector we created, based on the list of developing countries declared by the Minister for Foreign Affairs (https://www.dfat.gov.au/sites/default/files/list-developing-countries.pdf); countries on the pdf’s list but not in our data were removed, while countries in our data but not on the pdf were individually cross-checked with a search from data in the World Bank’s website).

Data description

Have an initial draft of your data description section. Your data description should be about your analysis-ready data.

The observations for our dataset are the sample of women in each country. The attributes consist of country name, sex, year the data was collected, age range of participants in the sample, pregnancy status, mean hemoglobin levels for that observation, and the individual proportions of the sample with a hemoglobin level below a certain threshold.

The original dataset from which this was derived was funded by the World Health Organization and created for the analysis of hemoglobin data for global anemia estimates in 2021. Original data was collected via health and nutrition examinations, as well as household surveys, and anonymized individual records drawn from various databases. The creation of our dataset was motivated by the fact that anemia is both the most common blood condition globally and is a serious issue in public health as low numbers of red blood cells associated with the condition are consequential to the body’s ability to perform physiological functions, with an increased risk of mortality. Hemoglobin thresholds can differ by age groups and pregnancy status, with vulnerable populations including pregnant women and children. Our group specifically wanted to narrow our analyses to seeing the relationship between estimated global anemia rates in women based on pregnancy status (pregnant/non-pregnant), age range, or country status (developing/developed). Outlining these distributions can help to identify whether the latter factor may have an impact on the former, and whether it may be appropriate to inform and target necessary scales of prevention measures against anemia for vulnerable populations based on such.

Additional preprocessing of the data for our current include filtering for sex (samples with women only, mixed sex samples were excluded from the data), and as we look for reputable list for country statuses, later labeling of countries by status (developed or developing).

Data limitations

A potential limitation of our dataset is that the same number or type of sample categories is not necessarily investigated/calculated for each country (e.g. Afghanistan accounts for 8 different sample categories while Albania only account for 6; samples for Albania are all adjusted for altitude while none are for Angola; different sets of age ranges are also sampled from from all categories). While not measured in this investigation, the dataset itself does not account for potential environmental confounding influences besides relative country status that may affect any potential differences in hemoglobin deficiencies across countries, making it difficult to isolate specific causal factors affecting anemia prevalence. We decided that we could just filter out certain data for uniformity.

In addition, as noted by the authors of the original WHO dataset report, processes that might have influenced what data was observed and recorded and what was not included: “The main limitation of our analysis is that despite the extensive data search and access, there were considerable gaps in data availability. As a result, the estimates may not capture the full variation across countries and regions, tending to”shrink” towards global means when data are sparse. This may have especially affected the estimates in high-income and upper-middle-income countries, where anemia prevalence is low and typically addressed in a clinical setting.

Another limitation is that our current dataset does not include a column for the status of the countries which we will be analyzing according to our research question. We are currently in the process of finding a reputable list of the status of countries.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

#analyzing the mean hemoglobin between pregnant vs. nonpregnant women
preg_hemo_analysis <- anaemia_estimates_inputdata_final |>
  select(mean, pregnancy) |>
  filter(pregnancy != "all women") |>
  group_by(pregnancy) |>
  summarize(hemoglobin_mean = mean(mean, na.rm = TRUE))
preg_hemo_analysis

# A tibble: 2 × 2
  pregnancy    hemoglobin_mean
  <chr>                  <dbl>
1 non-pregnant            117.
2 pregnant                114.

#bar visualization of analysis
preg_hemo_analysis |>
  ggplot(aes(x = pregnancy, y = hemoglobin_mean)) +
  geom_bar(stat = "identity",
           width = 0.5,
           fill = "lightblue") +
  theme_minimal() +
  labs(
    title = "Mean hemoglobin between pregnant vs. nonpregnant women",
    x = "Pregnancy",
    y = "Hemoglobin mean",
    caption = "Source: WHO"
  )

#analyzing age range and mean hemoglobin
age_hemo_analysis <- anaemia_estimates_inputdata_final |>
  select(agerange, mean) |>
  group_by(agerange) |>
  summarize(hemoglobin_mean = mean(mean, na.rm = TRUE)) |>
  filter(hemoglobin_mean != "NaN")
age_hemo_analysis

# A tibble: 77 × 2
   agerange hemoglobin_mean
   <chr>              <dbl>
 1 0-59                111.
 2 0-60                113.
 3 0-71                116.
 4 10-49               127 
 5 11-18               117 
 6 12-35               106 
 7 12-47               120 
 8 12-48               121.
 9 12-59               117.
10 12-60               118.
# ℹ 67 more rows

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Were the research questions clear and concise? Do you see these questions successfully being answered with the data we have?

Were there any limitations or biases in the data that you foresee should be taken into account? (Limitations that were not previously mentioned.)

Did the exploratory data analysis offer clear insight into our project?