An Exploration of Arabica Coffee and its Attributes

Exploratory data analysis

Research question(s)

  1. How do Arabica beans from different regions differ in their scores of aroma, flavor, etc? Does region impact the relationship between different scores? 

    • Regions including North America (United States), Central America (Costa Rica, El Salvador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Haiti), South America (dBrazil, Colombia, Ecuador, Peru), Africa (Burundi, Ethiopia, Kenya, Malawi, Rwanda, Tanzania, United Republic Of Uganda, Zambia, Cote d?Ivoire), and Asia (Taiwan, China, India, Thailand, Vietnam, Myanmar, Indonesia, Laos, Philippines, Papau New Guinea)

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

#load packages
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(skimr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
#import data, remove columns not using, change double to numeric
coffee <- read_csv("data/coffee.csv", col_types = cols(Location.Region = col_skip(), 
     Location.Altitude.Min = col_skip(), Location.Altitude.Max = col_skip(), 
     Location.Altitude.Average = col_integer(), 
     Year = col_skip(), Data.Owner = col_skip(), 
     `Data.Production.Number of bags` = col_skip(), 
     `Data.Production.Bag weight` = col_skip(), 
     Data.Scores.Aroma = col_number(), Data.Scores.Flavor = col_number(), 
     Data.Scores.Aftertaste = col_number(), 
     Data.Scores.Acidity = col_number(), Data.Scores.Body = col_number(), 
     Data.Scores.Balance = col_number(), Data.Scores.Uniformity = col_number(), 
     Data.Scores.Sweetness = col_number(), 
     Data.Scores.Moisture = col_number(), 
     Data.Color = col_skip()), na = "NA")

#clean variable names
coffee <- coffee |>
  clean_names()

#count number of each country
coffee2 <- coffee |>
  filter(data_type_species == "Arabica") |>
  group_by(location_country) |>
  summarize("country types" = n())

#create vectors for each region
north_america <- c("United States")
central_america <- c("Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", 
                     "Nicaragua", "Panama", "Haiti")
south_america <- c("Brazil", "Colombia", "Ecuador", "Peru")
africa <- c("Burundi", "Ethiopia", "Kenya", "Malawi", "Rwanda", "Tanzania, United Republic Of", "Uganda", "Zambia", "Cote d?Ivoire")
asia <- c("Taiwan", "China", "India", "Thailand", "Vietnam", "Myanmar", "Indonesia", "Laos", "Philippines", "Papua New Guinea")

#create regions column
coffee_regions <- coffee |>
mutate (
    region = case_when(
      location_country %in% north_america ~ "North America",
      location_country %in% central_america ~ "Central America",
      location_country %in% south_america ~ "South America",
      location_country %in% africa ~ "Africa",
      location_country %in% asia ~ "Asia",
      TRUE ~ NA_character_
    )) |>
    relocate(region, .after = location_country) |>
  filter(data_type_species == "Arabica") |>
  filter (data_scores_total != 0)

#count number of observations for each region
coffee_reg <- coffee_regions |>
  filter(data_type_species == "Arabica") |>
  group_by(region) |>
  summarize("region amt" = n())

Data description

This data set consists of Arabica coffee beans sorted into five different geographic regions dependent on the country the beans were from. Each observation of coffee beans was given a score from 0-10 regarding flavor, aroma, etc. There are also variables representing the average altitude of the location and the processing method of the beans. The original data set was published by @jthomasmock in his contribution to the TidyTuesday project on Github and is from the CORGIS Dataset Project by Sam Donald, created on July 6, 2020. The dataset is called coffee_regions.

Data limitations

One potential problem with our data set is that each region does not contain the same number of observations. For example, the North America region contains 63 observations while Central America contains 494 observations. However, this can be remedied by finding models that fit each region and thus do not rely on the number of observations. Also another potential problem is having NA values in our data and scores of 0. Thus, to fix this we can filter and drop the na’s.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

#regions-score plot; all have similar median around 82
ggplot(data = coffee_regions, mapping = aes(x = region, y = data_scores_total, color = region)) +
  geom_boxplot() +
  labs(
    x = "Region",
    y = "Total Score",
    title = "Total Score vs Region") +
  theme_minimal()

#histogram of processing methods
  ggplot(data = coffee_regions, aes(x = data_type_processing_method, fill = region)) +
  geom_bar() +
  labs(
    title = "Number of Coffee Samples for each Processing Method",
    x = "Processing Method",
    y = "Count"
    ) +
    theme_minimal()

#aroma-flavor scatterplot, positive linear relationship
ggplot(data = coffee_regions, mapping = 
         aes(x = data_scores_aroma, y = data_scores_flavor)) +
  geom_point() +
  labs(
    x = "Aroma Score",
    y = "Flavor Score",
    title = "Relationship between Aroma and Flavor") +
  theme_minimal()

#aroma-flavor scatterplot, faceted by region
ggplot(data = coffee_regions, mapping = 
         aes(x = data_scores_aroma, y = data_scores_flavor, color = region)) +
  geom_point() +
  labs(
    x = "Aroma Score",
    y = "Flavor Score",
    title = "Relationship between Aroma and Flavor") +
  theme_minimal() +
  facet_wrap(vars(region))

# scatterplot aftertaste-flavor 
coffee_regions |>
  ggplot(aes(x = data_scores_aftertaste, y = data_scores_flavor, color = region)) +
  geom_point() +
  labs(
    title = "Aftertaste vs Flavor Scores",
    x = "Aftertaste Score",
    y = "Flavor Score",
    color = "Region") +
  theme(legend.position = "bottom") +
  facet_wrap(vars(region)) +
  theme_minimal()

#scatter plot of altitude vs moisture score; no obvious correlation
ggplot(data = coffee_regions, mapping = 
         aes(x = location_altitude_average, 
             y = data_scores_moisture, 
             color = region)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Altitude vs. Moisture", 
    x = "Altitude (m)", 
    y = "Moisture", 
    color = "Region") +
  theme(legend.position = "bottom") +
  scale_x_continuous(limits = c(0,4000)) +
  theme_minimal()
Warning: Removed 5 rows containing missing values (`geom_point()`).

#summary statistics
coffee_regions |>
  gather(score, value, contains("score")) |>
  group_by(region, score) |>
  summarize(mean = mean(value, na.rm = TRUE),
            sd = sd(value, na.rm = TRUE),
            .groups = 'drop') |>
  knitr::kable(caption = "Summary Statistics of Coffee Scores by Region")
Summary Statistics of Coffee Scores by Region
region score mean sd
Africa data_scores_acidity 7.7098425 0.3386903
Africa data_scores_aftertaste 7.5833071 0.3199208
Africa data_scores_aroma 7.7534646 0.3188282
Africa data_scores_balance 7.6620472 0.2917009
Africa data_scores_body 7.6386614 0.2889454
Africa data_scores_flavor 7.6885039 0.3599142
Africa data_scores_moisture 0.1029134 0.0367555
Africa data_scores_sweetness 9.9685039 0.1849413
Africa data_scores_total 83.6285039 2.1200756
Africa data_scores_uniformity 9.9631496 0.1745957
Asia data_scores_acidity 7.5285897 0.3476665
Asia data_scores_aftertaste 7.3896154 0.3224277
Asia data_scores_aroma 7.5383333 0.3241229
Asia data_scores_balance 7.4860256 0.3413660
Asia data_scores_body 7.5070513 0.2584374
Asia data_scores_flavor 7.5238462 0.3502957
Asia data_scores_moisture 0.0855128 0.0584796
Asia data_scores_sweetness 9.9057692 0.4123059
Asia data_scores_total 82.2232051 2.1182163
Asia data_scores_uniformity 9.8971795 0.2861382
Central America data_scores_acidity 7.4761460 0.3015626
Central America data_scores_aftertaste 7.2637323 0.3497105
Central America data_scores_aroma 7.5009736 0.2895003
Central America data_scores_balance 7.3950101 0.3511911
Central America data_scores_body 7.4228803 0.2708490
Central America data_scores_flavor 7.4234077 0.3431705
Central America data_scores_moisture 0.1068763 0.0292240
Central America data_scores_sweetness 9.8931034 0.5674435
Central America data_scores_total 81.2885396 3.0930059
Central America data_scores_uniformity 9.7983367 0.5574011
North America data_scores_acidity 7.6298413 0.3420125
North America data_scores_aftertaste 7.5192063 0.3951043
North America data_scores_aroma 7.5674603 0.2940922
North America data_scores_balance 7.6646032 0.3434682
North America data_scores_body 7.6549206 0.2768362
North America data_scores_flavor 7.6055556 0.4061176
North America data_scores_moisture 0.0666667 0.0550073
North America data_scores_sweetness 9.6504762 0.8280183
North America data_scores_total 81.9742857 3.3994562
North America data_scores_uniformity 9.4915873 0.8267937
South America data_scores_acidity 7.5857286 0.2854455
South America data_scores_aftertaste 7.5393970 0.2385597
South America data_scores_aroma 7.6706030 0.3113211
South America data_scores_balance 7.6435678 0.2507561
South America data_scores_body 7.6188442 0.2734112
South America data_scores_flavor 7.6225628 0.2333852
South America data_scores_moisture 0.0717588 0.0537242
South America data_scores_sweetness 9.9664824 0.2657269
South America data_scores_total 83.0766332 1.7998780
South America data_scores_uniformity 9.9128643 0.4089644

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

  1. Do you have any suggestions as to which specific variables would be best to be further explored?

    For example, should we examine the flavor score against all other scores?

  2. Should we change our analysis to only include the processing method of washed/wet since this is the majority of the observations?

  3. Would it be beneficial to include altitude in our further analysis, even though it did not seem to have a specific correlation with the moisture score?