An Exploration of Arabica Coffee and its Attributes

Exploratory data analysis

Research question(s)

How do Arabica beans from different regions differ in their scores of aroma, flavor, etc? Does region impact the relationship between different scores?
- Regions including North America (United States), Central America (Costa Rica, El Salvador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Haiti), South America (dBrazil, Colombia, Ecuador, Peru), Africa (Burundi, Ethiopia, Kenya, Malawi, Rwanda, Tanzania, United Republic Of Uganda, Zambia, Cote d?Ivoire), and Asia (Taiwan, China, India, Thailand, Vietnam, Myanmar, Indonesia, Laos, Philippines, Papau New Guinea)

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

#load packages
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(skimr)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

#import data, remove columns not using, change double to numeric
coffee <- read_csv("data/coffee.csv", col_types = cols(Location.Region = col_skip(), 
     Location.Altitude.Min = col_skip(), Location.Altitude.Max = col_skip(), 
     Location.Altitude.Average = col_integer(), 
     Year = col_skip(), Data.Owner = col_skip(), 
     `Data.Production.Number of bags` = col_skip(), 
     `Data.Production.Bag weight` = col_skip(), 
     Data.Scores.Aroma = col_number(), Data.Scores.Flavor = col_number(), 
     Data.Scores.Aftertaste = col_number(), 
     Data.Scores.Acidity = col_number(), Data.Scores.Body = col_number(), 
     Data.Scores.Balance = col_number(), Data.Scores.Uniformity = col_number(), 
     Data.Scores.Sweetness = col_number(), 
     Data.Scores.Moisture = col_number(), 
     Data.Color = col_skip()), na = "NA")

#clean variable names
coffee <- coffee |>
  clean_names()

#count number of each country
coffee2 <- coffee |>
  filter(data_type_species == "Arabica") |>
  group_by(location_country) |>
  summarize("country types" = n())

#create vectors for each region
north_america <- c("United States")
central_america <- c("Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", 
                     "Nicaragua", "Panama", "Haiti")
south_america <- c("Brazil", "Colombia", "Ecuador", "Peru")
africa <- c("Burundi", "Ethiopia", "Kenya", "Malawi", "Rwanda", "Tanzania, United Republic Of", "Uganda", "Zambia", "Cote d?Ivoire")
asia <- c("Taiwan", "China", "India", "Thailand", "Vietnam", "Myanmar", "Indonesia", "Laos", "Philippines", "Papua New Guinea")

#create regions column
coffee_regions <- coffee |>
mutate (
    region = case_when(
      location_country %in% north_america ~ "North America",
      location_country %in% central_america ~ "Central America",
      location_country %in% south_america ~ "South America",
      location_country %in% africa ~ "Africa",
      location_country %in% asia ~ "Asia",
      TRUE ~ NA_character_
    )) |>
    relocate(region, .after = location_country) |>
  filter(data_type_species == "Arabica") |>
  filter (data_scores_total != 0)

#count number of observations for each region
coffee_reg <- coffee_regions |>
  filter(data_type_species == "Arabica") |>
  group_by(region) |>
  summarize("region amt" = n())

Data description

This data set consists of Arabica coffee beans sorted into five different geographic regions dependent on the country the beans were from. Each observation of coffee beans was given a score from 0-10 regarding flavor, aroma, etc. There are also variables representing the average altitude of the location and the processing method of the beans. The original data set was published by @jthomasmock in his contribution to the TidyTuesday project on Github and is from the CORGIS Dataset Project by Sam Donald, created on July 6, 2020. The dataset is called coffee_regions.

Data limitations

One potential problem with our data set is that each region does not contain the same number of observations. For example, the North America region contains 63 observations while Central America contains 494 observations. However, this can be remedied by finding models that fit each region and thus do not rely on the number of observations. Also another potential problem is having NA values in our data and scores of 0. Thus, to fix this we can filter and drop the na’s.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

#regions-score plot; all have similar median around 82
ggplot(data = coffee_regions, mapping = aes(x = region, y = data_scores_total, color = region)) +
  geom_boxplot() +
  labs(
    x = "Region",
    y = "Total Score",
    title = "Total Score vs Region") +
  theme_minimal()

#histogram of processing methods
  ggplot(data = coffee_regions, aes(x = data_type_processing_method, fill = region)) +
  geom_bar() +
  labs(
    title = "Number of Coffee Samples for each Processing Method",
    x = "Processing Method",
    y = "Count"
    ) +
    theme_minimal()

#aroma-flavor scatterplot, positive linear relationship
ggplot(data = coffee_regions, mapping = 
         aes(x = data_scores_aroma, y = data_scores_flavor)) +
  geom_point() +
  labs(
    x = "Aroma Score",
    y = "Flavor Score",
    title = "Relationship between Aroma and Flavor") +
  theme_minimal()

#aroma-flavor scatterplot, faceted by region
ggplot(data = coffee_regions, mapping = 
         aes(x = data_scores_aroma, y = data_scores_flavor, color = region)) +
  geom_point() +
  labs(
    x = "Aroma Score",
    y = "Flavor Score",
    title = "Relationship between Aroma and Flavor") +
  theme_minimal() +
  facet_wrap(vars(region))

# scatterplot aftertaste-flavor 
coffee_regions |>
  ggplot(aes(x = data_scores_aftertaste, y = data_scores_flavor, color = region)) +
  geom_point() +
  labs(
    title = "Aftertaste vs Flavor Scores",
    x = "Aftertaste Score",
    y = "Flavor Score",
    color = "Region") +
  theme(legend.position = "bottom") +
  facet_wrap(vars(region)) +
  theme_minimal()

#scatter plot of altitude vs moisture score; no obvious correlation
ggplot(data = coffee_regions, mapping = 
         aes(x = location_altitude_average, 
             y = data_scores_moisture, 
             color = region)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Altitude vs. Moisture", 
    x = "Altitude (m)", 
    y = "Moisture", 
    color = "Region") +
  theme(legend.position = "bottom") +
  scale_x_continuous(limits = c(0,4000)) +
  theme_minimal()

Warning: Removed 5 rows containing missing values (`geom_point()`).

#summary statistics
coffee_regions |>
  gather(score, value, contains("score")) |>
  group_by(region, score) |>
  summarize(mean = mean(value, na.rm = TRUE),
            sd = sd(value, na.rm = TRUE),
            .groups = 'drop') |>
  knitr::kable(caption = "Summary Statistics of Coffee Scores by Region")

Summary Statistics of Coffee Scores by Region
region	score	mean	sd
Africa	data_scores_acidity	7.7098425	0.3386903
Africa	data_scores_aftertaste	7.5833071	0.3199208
Africa	data_scores_aroma	7.7534646	0.3188282
Africa	data_scores_balance	7.6620472	0.2917009
Africa	data_scores_body	7.6386614	0.2889454
Africa	data_scores_flavor	7.6885039	0.3599142
Africa	data_scores_moisture	0.1029134	0.0367555
Africa	data_scores_sweetness	9.9685039	0.1849413
Africa	data_scores_total	83.6285039	2.1200756
Africa	data_scores_uniformity	9.9631496	0.1745957
Asia	data_scores_acidity	7.5285897	0.3476665
Asia	data_scores_aftertaste	7.3896154	0.3224277
Asia	data_scores_aroma	7.5383333	0.3241229
Asia	data_scores_balance	7.4860256	0.3413660
Asia	data_scores_body	7.5070513	0.2584374
Asia	data_scores_flavor	7.5238462	0.3502957
Asia	data_scores_moisture	0.0855128	0.0584796
Asia	data_scores_sweetness	9.9057692	0.4123059
Asia	data_scores_total	82.2232051	2.1182163
Asia	data_scores_uniformity	9.8971795	0.2861382
Central America	data_scores_acidity	7.4761460	0.3015626
Central America	data_scores_aftertaste	7.2637323	0.3497105
Central America	data_scores_aroma	7.5009736	0.2895003
Central America	data_scores_balance	7.3950101	0.3511911
Central America	data_scores_body	7.4228803	0.2708490
Central America	data_scores_flavor	7.4234077	0.3431705
Central America	data_scores_moisture	0.1068763	0.0292240
Central America	data_scores_sweetness	9.8931034	0.5674435
Central America	data_scores_total	81.2885396	3.0930059
Central America	data_scores_uniformity	9.7983367	0.5574011
North America	data_scores_acidity	7.6298413	0.3420125
North America	data_scores_aftertaste	7.5192063	0.3951043
North America	data_scores_aroma	7.5674603	0.2940922
North America	data_scores_balance	7.6646032	0.3434682
North America	data_scores_body	7.6549206	0.2768362
North America	data_scores_flavor	7.6055556	0.4061176
North America	data_scores_moisture	0.0666667	0.0550073
North America	data_scores_sweetness	9.6504762	0.8280183
North America	data_scores_total	81.9742857	3.3994562
North America	data_scores_uniformity	9.4915873	0.8267937
South America	data_scores_acidity	7.5857286	0.2854455
South America	data_scores_aftertaste	7.5393970	0.2385597
South America	data_scores_aroma	7.6706030	0.3113211
South America	data_scores_balance	7.6435678	0.2507561
South America	data_scores_body	7.6188442	0.2734112
South America	data_scores_flavor	7.6225628	0.2333852
South America	data_scores_moisture	0.0717588	0.0537242
South America	data_scores_sweetness	9.9664824	0.2657269
South America	data_scores_total	83.0766332	1.7998780
South America	data_scores_uniformity	9.9128643	0.4089644

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Do you have any suggestions as to which specific variables would be best to be further explored?

For example, should we examine the flavor score against all other scores?
Should we change our analysis to only include the processing method of washed/wet since this is the majority of the observations?
Would it be beneficial to include altitude in our further analysis, even though it did not seem to have a specific correlation with the moisture score?