Project proposal

Author

Dank-Bop (Abigail Grizancic, Ava Zhang)

library(tidyverse)
#install.packages("skimr")

#renv::restore()

#saves to data file
#coffee_survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-05-14/coffee_survey.csv')
#write_rds(coffee_survey, "data/coffee-survey.rds")

coffee_survey <- read_rds("data/coffee-survey.rds")

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

This dataset is the “Great American Coffee Taste Test” which comes from a Youtube livestream hosted by Cometeer, a coffee company, and James Hoffmann, a renowned barista. 4042 people filled out a survey about their coffee habits, and their thoughts about the 4 Cometeer coffees offered at the tasting. This data set includes 57 columns, which include personal questions as well as questions about the coffees taste tested. Of the columns in the dataset, 44 are character columns and 13 are numeric columns. Below you can see more information about the columns, including the number of missing entries and a simplified distribution. We chose this data set because, as college students, we drink a lot of coffee and would like to investigate trends in other coffee drinkers’ preferences to consume better coffee.

In the introduction to the report, we will further explore the demographics of the taste testers from this data set, to see how representative they are of the U.S. as a whole. This will allow us to avoid generalizations of our findings if they are not representative.

skimr::skim(coffee_survey)
Data summary
Name coffee_survey
Number of rows 4042
Number of columns 57
_______________________
Column type frequency:
character 44
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
submission_id 0 1.00 6 6 0 4042 0
age 31 0.99 13 15 0 7 0
cups 93 0.98 1 11 0 6 0
where_drink 70 0.98 7 44 0 65 0
brew 385 0.90 5 165 0 449 0
brew_other 3364 0.17 2 319 0 160 0
purchase 3332 0.18 5 107 0 89 0
purchase_other 4011 0.01 4 83 0 26 0
favorite 62 0.98 5 32 0 12 0
favorite_specify 3926 0.03 3 92 0 78 0
additions 83 0.98 5 100 0 53 0
additions_other 3994 0.01 3 140 0 42 0
dairy 2356 0.42 8 110 0 175 0
sweetener 3530 0.13 5 99 0 82 0
style 84 0.98 4 11 0 12 0
strength 126 0.97 4 15 0 5 0
roast_level 102 0.97 4 7 0 7 0
caffeine 125 0.97 5 13 0 3 0
coffee_a_notes 1464 0.64 1 377 0 2317 0
coffee_b_notes 1586 0.61 1 980 0 2199 0
coffee_c_notes 1659 0.59 1 438 0 2163 0
coffee_d_notes 1454 0.64 1 528 0 2354 0
prefer_abc 270 0.93 8 8 0 3 0
prefer_ad 281 0.93 8 8 0 2 0
prefer_overall 272 0.93 8 8 0 4 0
wfh 518 0.87 18 26 0 3 0
total_spend 531 0.87 4 8 0 6 0
why_drink 474 0.88 5 93 0 84 0
why_drink_other 3875 0.04 2 195 0 163 0
taste 479 0.88 2 3 0 2 0
know_source 483 0.88 2 3 0 2 0
most_paid 515 0.87 5 13 0 8 0
most_willing 532 0.87 5 13 0 8 0
value_cafe 542 0.87 2 3 0 2 0
spent_equipment 536 0.87 7 16 0 7 0
value_equipment 548 0.86 2 3 0 2 0
gender 519 0.87 4 22 0 5 0
gender_specify 4030 0.00 2 28 0 11 0
education_level 604 0.85 15 34 0 6 0
ethnicity_race 624 0.85 15 29 0 6 0
ethnicity_race_specify 3937 0.03 2 53 0 82 0
employment_status 623 0.85 7 18 0 6 0
number_children 636 0.84 1 11 0 5 0
political_affiliation 753 0.81 8 14 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
expertise 104 0.97 5.69 1.95 1 5 6 7 10 ▂▃▇▇▁
coffee_a_bitterness 244 0.94 2.14 0.95 1 1 2 3 5 ▅▇▃▂▁
coffee_a_acidity 263 0.93 3.63 0.98 1 3 4 4 5 ▁▂▅▇▃
coffee_a_personal_preference 253 0.94 3.31 1.19 1 2 3 4 5 ▂▅▆▇▅
coffee_b_bitterness 262 0.94 3.01 0.99 1 2 3 4 5 ▂▅▇▆▁
coffee_b_acidity 275 0.93 2.22 0.87 1 2 2 3 5 ▃▇▅▁▁
coffee_b_personal_preference 269 0.93 3.07 1.11 1 2 3 4 5 ▂▆▇▆▃
coffee_c_bitterness 278 0.93 3.07 1.00 1 2 3 4 5 ▁▅▇▆▂
coffee_c_acidity 291 0.93 2.37 0.92 1 2 2 3 5 ▃▇▆▂▁
coffee_c_personal_preference 276 0.93 3.06 1.13 1 2 3 4 5 ▂▆▇▆▃
coffee_d_bitterness 275 0.93 2.16 1.08 1 1 2 3 5 ▇▇▅▂▁
coffee_d_acidity 277 0.93 3.86 1.01 1 3 4 5 5 ▁▂▃▇▆
coffee_d_personal_preference 278 0.93 3.38 1.45 1 2 4 5 5 ▅▃▃▆▇

Questions

The two questions you want to answer.

  1. How does favorite coffee genre (style, additions, sweetener etc.) vary in different demographic groups (e.g. age, gender, and employment status)?
  2. Does workstyle (work from home vs hybrid vs on-site) affect coffee consumption habits (amount consumed, willing to brew at home)?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Q1 Variables Involved:

  • Dependent (Coffee Preferences): favorite, style , additions, sweetener, dairy, roast_level, strength, caffeine

  • Independent (Demographics): age, gender, employment_status, ethnicity_race

  • Variables to be Created:

    • Coffee Genre Clustering: categorical clusters of coffee preferences based on style, additions, sweetener, dairy, and roast_level. For example: Classic Black (No additions, dark roast); Sweetened Coffee (Uses sugar/syrup and milk); Specialty Coffee Lovers (Preference for espresso-based drinks)

Our plan is to calculate the frequency of different coffee preferences across demographics by generating summary statistics for each coffee preference variable and visualizing them with a stacked bar charts. Also to compare coffee preferences (e.g., style, roast_level) across different age groups, and genders. To avoid overloading graphs, we will used the clustered coffee types to represent combinations of coffee preferences instead of looking at all of them separately. We may also include faceting to better view differences between groups, if needed.

Q2: Variables Involved

  • Dependent(Coffee Consumption Habits): cups, brew, purchase, total_spend, spent_equipment

  • Independent Variable (Work style): wfh

  • Variables to be Created:

    • Binarized Work style Categories: Brew at Home vs. Purchase Habit (categorizes: Home Brewers (Primarily brew coffee at home), Buyers (Primarily purchase coffee outside))

    • Spending Habits Per Cup (average coffee spending per cup: total_spend / cups_per_month). The plan is to calculate average cups consumption per work style category.

    • Compute spending habits (total_spend, spent_equipment) by work style.

We will investigate how much people spend on coffee, both in the sense of buying it from cafes and in the sense of their home equipment, based on whether they work from home or in person. We will visualize this using bar charts / Violin plots to show distributions of total_spend (amount spent on coffee) and spent_equipment by work style.