Project title

Appendix to report

Data cleaning

  1. We first created a dataset called coffee using the csv file we downloaded.
  2. We select the columns that we need for our research questions, and we also selected some columns that are not in our question, but we think we might use.
  3. We rename the columns to names that are easy to use and identify.
  4. We got rid of scores that are extremely low (0).
  5. We used the is.na() function to check for missing values in the dataset. Fortunately, there were no missing values in the dataset, so we did not need to impute any values.
  6. We used the duplicated() function to check for duplicate rows in the dataset. We found that there were no duplicate rows.
  7. We used the skim() function from the skimr package to generate summary statistics for each column in the dataset, including minimum and maximum values, mean, standard deviation, and percentiles. We used this information to identify potential outliers in the data.
  8. We also filtered the data to remove 0s that were present in the aroma score data.
  9. We also added a column to group the countries by continent.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Attaching package: 'openintro'

The following object is masked from 'package:modeldata':

    ames
library(skimr)
library(scales)

coffee <- read.csv("data/coffee.csv")
coffee <- coffee |>
  select(Location.Country, Location.Region, Year, Data.Type.Species, Data.Scores.Aroma, Data.Scores.Flavor, Data.Scores.Aftertaste, Data.Scores.Acidity, Data.Scores.Balance, Data.Scores.Sweetness, Data.Scores.Moisture, Data.Scores.Total) |>
  rename(country = Location.Country, region = Location.Region, year = Year, species = Data.Type.Species, aroma_score = Data.Scores.Aroma, flavor_score = Data.Scores.Flavor, aftertaste_score = Data.Scores.Aftertaste, acidity_score = Data.Scores.Acidity, balance_score = Data.Scores.Balance, sweetness_score = Data.Scores.Sweetness, moisture_score = Data.Scores.Moisture, total_score = Data.Scores.Total)
coffee <- coffee |>
  filter(aroma_score != 0)

#possible research question: How does coffee's aroma, acidity, and total score depend on their original country and region (continent)?

glimpse(coffee)
Rows: 988
Columns: 12
$ country          <chr> "United States", "Brazil", "Brazil", "Ethiopia", "Eth…
$ region           <chr> "kona", "sul de minas - carmo de minas", "sul de mina…
$ year             <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
$ species          <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Arabica"…
$ aroma_score      <dbl> 8.25, 8.17, 8.42, 7.67, 7.58, 7.50, 7.67, 7.25, 7.42,…
$ flavor_score     <dbl> 8.42, 7.92, 7.92, 8.00, 7.83, 7.92, 7.58, 7.25, 7.42,…
$ aftertaste_score <dbl> 8.08, 7.92, 8.00, 7.83, 7.58, 7.42, 7.50, 7.25, 7.50,…
$ acidity_score    <dbl> 7.75, 7.75, 7.75, 8.00, 8.00, 7.67, 7.58, 7.33, 7.92,…
$ balance_score    <dbl> 7.83, 8.00, 8.00, 7.83, 7.50, 7.58, 7.58, 8.00, 7.58,…
$ sweetness_score  <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.0…
$ moisture_score   <dbl> 0.00, 0.08, 0.01, 0.00, 0.10, 0.01, 0.00, 0.10, 0.05,…
$ total_score      <dbl> 86.25, 86.17, 86.17, 85.08, 83.83, 83.42, 83.08, 80.3…
Asia = c("China", "India", "Indonesia", "Laos", "Philippines", "Taiwan", "Thailand", "Vietnam", "Papua New Guinea", "Myanmar")

North_America = c("United States", "Mexico")

Central_America = c("Haiti", "Honduras", "Nicaragua", "Panama", "Costa Rica", "El Salvador", "Guatemala")

South_America = c("Brazil", "Colombia", "Ecuador", "Peru")

Africa = c("Burundi", "Cote d?Ivoire", "Ethiopia", "Kenya", "Malawi", "Rwanda", "Tanzania, United Republic Of", "Uganda", "Zambia")

coffee_country <- coffee |>
  group_by(country) |>
  mutate(continent = case_when(
    country %in% Asia ~ "Asia",
    country %in% Africa ~ "Africa",
    country %in% North_America ~ "North America",
    country %in% Central_America ~ "Central America",
    country %in% South_America ~ "South America"
  ))
coffee_country
# A tibble: 988 × 13
# Groups:   country [32]
   country       region   year species aroma_score flavor_score aftertaste_score
   <chr>         <chr>   <int> <chr>         <dbl>        <dbl>            <dbl>
 1 United States kona     2010 Arabica        8.25         8.42             8.08
 2 Brazil        sul de…  2010 Arabica        8.17         7.92             7.92
 3 Brazil        sul de…  2010 Arabica        8.42         7.92             8   
 4 Ethiopia      sidamo   2010 Arabica        7.67         8                7.83
 5 Ethiopia      sidamo   2010 Arabica        7.58         7.83             7.58
 6 United States kona     2010 Arabica        7.5          7.92             7.42
 7 Indonesia     dolok …  2010 Arabica        7.67         7.58             7.5 
 8 Ethiopia      kelem …  2010 Arabica        7.25         7.25             7.25
 9 Ethiopia      limu     2010 Arabica        7.42         7.42             7.5 
10 Haiti         marmel…  2010 Arabica        6.92         6.75             7.08
# ℹ 978 more rows
# ℹ 6 more variables: acidity_score <dbl>, balance_score <dbl>,
#   sweetness_score <dbl>, moisture_score <dbl>, total_score <dbl>,
#   continent <chr>
write.csv(coffee, "data/new_coffee.csv")

In the end, we created a new csv file of coffee.csv (new_coffee.csv) that is the updated and cleaned version of the original dataset.

Other appendicies (as necessary)