An Exploration of Arabica Coffee and its Attributes

Appendix to report

Data cleaning

First the data was downloaded from the CORGIS Dataset Project and added into the files of the project in a folder called data, with the data self itself called coffee. Then, columns that we were not utilizing were removed and all variables that were type double were changed to numeric. We then used the janitor package to clean the column names. We then filtered the data to include only coffee beans of the type Arabica since this made up the majority of the types of coffee being used in the survey and we wanted to only examine this type of bean. We also made a new data frame coffee2 to count the number of each country so as to examine how we should group the regions. Then, based off these counts we created vectors for each different region included in the dataframe: Africa, Asia, Central America, North America, and South America. We then created the variable region and used the case_when function to sort each country into its corresponding region vector. Lastly, we filtered the data so as to not include total data scores of 0.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(skimr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
#import data, remove columns not using, change double to numeric
coffee <- read_csv("data/coffee.csv", col_types = cols(Location.Region = col_skip(), 
     Location.Altitude.Min = col_skip(), Location.Altitude.Max = col_skip(), 
     Location.Altitude.Average = col_integer(), 
     Year = col_skip(), Data.Owner = col_skip(), 
     `Data.Production.Number of bags` = col_skip(), 
     `Data.Production.Bag weight` = col_skip(), 
     Data.Scores.Aroma = col_number(), Data.Scores.Flavor = col_number(), 
     Data.Scores.Aftertaste = col_number(), 
     Data.Scores.Acidity = col_number(), Data.Scores.Body = col_number(), 
     Data.Scores.Balance = col_number(), Data.Scores.Uniformity = col_number(), 
     Data.Scores.Sweetness = col_number(), 
     Data.Scores.Moisture = col_number(), 
     Data.Color = col_skip()), na = "NA")

#clean variable names
coffee <- coffee |>
  clean_names()

#count number of each country
coffee2 <- coffee |>
  filter(data_type_species == "Arabica") |>
  group_by(location_country) |>
  summarize("country types" = n())

coffee2
# A tibble: 32 × 2
   location_country `country types`
   <chr>                      <int>
 1 Brazil                        61
 2 Burundi                        2
 3 China                         16
 4 Colombia                     128
 5 Costa Rica                    43
 6 Cote d?Ivoire                  1
 7 Ecuador                        1
 8 El Salvador                    6
 9 Ethiopia                      28
10 Guatemala                    168
# ℹ 22 more rows
#create vectors for each region
north_america <- c("United States")
central_america <- c("Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", 
                     "Nicaragua", "Panama", "Haiti")
south_america <- c("Brazil", "Colombia", "Ecuador", "Peru")
africa <- c("Burundi", "Ethiopia", "Kenya", "Malawi", "Rwanda", "Tanzania, United Republic Of", "Uganda", "Zambia", "Cote d?Ivoire")
asia <- c("Taiwan", "China", "India", "Thailand", "Vietnam", "Myanmar", "Indonesia", "Laos", "Philippines", "Papua New Guinea")

#create regions column
coffee_regions <- coffee |>
mutate (
    region = case_when(
      location_country %in% north_america ~ "North America",
      location_country %in% central_america ~ "Central America",
      location_country %in% south_america ~ "South America",
      location_country %in% africa ~ "Africa",
      location_country %in% asia ~ "Asia",
      TRUE ~ NA_character_
    )) |>
    relocate(region, .after = location_country) |>
  filter(data_type_species == "Arabica") |>
  filter (data_scores_total != 0)

skim(coffee_regions)
Data summary
Name coffee_regions
Number of rows 960
Number of columns 16
_______________________
Column type frequency:
character 5
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
location_country 0 1 4 28 0 32 0
region 0 1 4 15 0 5 0
data_type_species 0 1 7 7 0 1 0
data_type_variety 0 1 3 21 0 28 0
data_type_processing_method 0 1 3 25 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
location_altitude_average 0 1 1670.75 9328.49 0.00 950.00 1300.00 1600.00 190164.00 ▇▁▁▁▁
data_scores_aroma 0 1 7.58 0.32 5.08 7.42 7.58 7.75 8.75 ▁▁▂▇▁
data_scores_flavor 0 1 7.52 0.35 6.08 7.33 7.50 7.75 8.83 ▁▂▇▃▁
data_scores_aftertaste 0 1 7.39 0.35 6.17 7.17 7.42 7.58 8.67 ▁▃▇▂▁
data_scores_acidity 0 1 7.54 0.32 5.25 7.33 7.58 7.75 8.75 ▁▁▃▇▁
data_scores_body 0 1 7.51 0.29 5.25 7.33 7.50 7.67 8.50 ▁▁▁▇▁
data_scores_balance 0 1 7.51 0.35 6.08 7.33 7.50 7.75 8.58 ▁▂▇▅▁
data_scores_uniformity 0 1 9.83 0.51 6.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇
data_scores_sweetness 0 1 9.90 0.50 1.33 10.00 10.00 10.00 10.00 ▁▁▁▁▇
data_scores_moisture 0 1 0.09 0.04 0.00 0.10 0.11 0.12 0.28 ▃▇▆▁▁
data_scores_total 0 1 82.09 2.85 59.83 81.08 82.50 83.67 90.58 ▁▁▁▇▁