An Exploration of Arabica Coffee and its Attributes
Appendix to report
Data cleaning
First the data was downloaded from the CORGIS Dataset Project and added into the files of the project in a folder called data, with the data self itself called coffee. Then, columns that we were not utilizing were removed and all variables that were type double were changed to numeric. We then used the janitor package to clean the column names. We then filtered the data to include only coffee beans of the type Arabica since this made up the majority of the types of coffee being used in the survey and we wanted to only examine this type of bean. We also made a new data frame coffee2 to count the number of each country so as to examine how we should group the regions. Then, based off these counts we created vectors for each different region included in the dataframe: Africa, Asia, Central America, North America, and South America. We then created the variable region and used the case_when function to sort each country into its corresponding region vector. Lastly, we filtered the data so as to not include total data scores of 0.
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
#import data, remove columns not using, change double to numericcoffee <-read_csv("data/coffee.csv", col_types =cols(Location.Region =col_skip(), Location.Altitude.Min =col_skip(), Location.Altitude.Max =col_skip(), Location.Altitude.Average =col_integer(), Year =col_skip(), Data.Owner =col_skip(), `Data.Production.Number of bags`=col_skip(), `Data.Production.Bag weight`=col_skip(), Data.Scores.Aroma =col_number(), Data.Scores.Flavor =col_number(), Data.Scores.Aftertaste =col_number(), Data.Scores.Acidity =col_number(), Data.Scores.Body =col_number(), Data.Scores.Balance =col_number(), Data.Scores.Uniformity =col_number(), Data.Scores.Sweetness =col_number(), Data.Scores.Moisture =col_number(), Data.Color =col_skip()), na ="NA")#clean variable namescoffee <- coffee |>clean_names()#count number of each countrycoffee2 <- coffee |>filter(data_type_species =="Arabica") |>group_by(location_country) |>summarize("country types"=n())coffee2
# A tibble: 32 × 2
location_country `country types`
<chr> <int>
1 Brazil 61
2 Burundi 2
3 China 16
4 Colombia 128
5 Costa Rica 43
6 Cote d?Ivoire 1
7 Ecuador 1
8 El Salvador 6
9 Ethiopia 28
10 Guatemala 168
# ℹ 22 more rows