library(tidyverse)
#install.packages("skimr")
#renv::restore()
#saves to data file
#coffee_survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-05-14/coffee_survey.csv')
#write_rds(coffee_survey, "data/coffee-survey.rds")
<- read_rds("data/coffee-survey.rds") coffee_survey
Project proposal
Dataset
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
Make sure to load the data and use inline code for some of this information.
This dataset is the “Great American Coffee Taste Test” which comes from a Youtube livestream hosted by Cometeer, a coffee company, and James Hoffmann, a renowned barista. 4042 people filled out a survey about their coffee habits, and their thoughts about the 4 Cometeer coffees offered at the tasting. This data set includes 57 columns, which include personal questions as well as questions about the coffees taste tested. Of the columns in the dataset, 44 are character columns and 13 are numeric columns. Below you can see more information about the columns, including the number of missing entries and a simplified distribution. We chose this data set because, as college students, we drink a lot of coffee and would like to investigate trends in other coffee drinkers’ preferences to consume better coffee.
In the introduction to the report, we will further explore the demographics of the taste testers from this data set, to see how representative they are of the U.S. as a whole. This will allow us to avoid generalizations of our findings if they are not representative.
::skim(coffee_survey) skimr
Name | coffee_survey |
Number of rows | 4042 |
Number of columns | 57 |
_______________________ | |
Column type frequency: | |
character | 44 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
submission_id | 0 | 1.00 | 6 | 6 | 0 | 4042 | 0 |
age | 31 | 0.99 | 13 | 15 | 0 | 7 | 0 |
cups | 93 | 0.98 | 1 | 11 | 0 | 6 | 0 |
where_drink | 70 | 0.98 | 7 | 44 | 0 | 65 | 0 |
brew | 385 | 0.90 | 5 | 165 | 0 | 449 | 0 |
brew_other | 3364 | 0.17 | 2 | 319 | 0 | 160 | 0 |
purchase | 3332 | 0.18 | 5 | 107 | 0 | 89 | 0 |
purchase_other | 4011 | 0.01 | 4 | 83 | 0 | 26 | 0 |
favorite | 62 | 0.98 | 5 | 32 | 0 | 12 | 0 |
favorite_specify | 3926 | 0.03 | 3 | 92 | 0 | 78 | 0 |
additions | 83 | 0.98 | 5 | 100 | 0 | 53 | 0 |
additions_other | 3994 | 0.01 | 3 | 140 | 0 | 42 | 0 |
dairy | 2356 | 0.42 | 8 | 110 | 0 | 175 | 0 |
sweetener | 3530 | 0.13 | 5 | 99 | 0 | 82 | 0 |
style | 84 | 0.98 | 4 | 11 | 0 | 12 | 0 |
strength | 126 | 0.97 | 4 | 15 | 0 | 5 | 0 |
roast_level | 102 | 0.97 | 4 | 7 | 0 | 7 | 0 |
caffeine | 125 | 0.97 | 5 | 13 | 0 | 3 | 0 |
coffee_a_notes | 1464 | 0.64 | 1 | 377 | 0 | 2317 | 0 |
coffee_b_notes | 1586 | 0.61 | 1 | 980 | 0 | 2199 | 0 |
coffee_c_notes | 1659 | 0.59 | 1 | 438 | 0 | 2163 | 0 |
coffee_d_notes | 1454 | 0.64 | 1 | 528 | 0 | 2354 | 0 |
prefer_abc | 270 | 0.93 | 8 | 8 | 0 | 3 | 0 |
prefer_ad | 281 | 0.93 | 8 | 8 | 0 | 2 | 0 |
prefer_overall | 272 | 0.93 | 8 | 8 | 0 | 4 | 0 |
wfh | 518 | 0.87 | 18 | 26 | 0 | 3 | 0 |
total_spend | 531 | 0.87 | 4 | 8 | 0 | 6 | 0 |
why_drink | 474 | 0.88 | 5 | 93 | 0 | 84 | 0 |
why_drink_other | 3875 | 0.04 | 2 | 195 | 0 | 163 | 0 |
taste | 479 | 0.88 | 2 | 3 | 0 | 2 | 0 |
know_source | 483 | 0.88 | 2 | 3 | 0 | 2 | 0 |
most_paid | 515 | 0.87 | 5 | 13 | 0 | 8 | 0 |
most_willing | 532 | 0.87 | 5 | 13 | 0 | 8 | 0 |
value_cafe | 542 | 0.87 | 2 | 3 | 0 | 2 | 0 |
spent_equipment | 536 | 0.87 | 7 | 16 | 0 | 7 | 0 |
value_equipment | 548 | 0.86 | 2 | 3 | 0 | 2 | 0 |
gender | 519 | 0.87 | 4 | 22 | 0 | 5 | 0 |
gender_specify | 4030 | 0.00 | 2 | 28 | 0 | 11 | 0 |
education_level | 604 | 0.85 | 15 | 34 | 0 | 6 | 0 |
ethnicity_race | 624 | 0.85 | 15 | 29 | 0 | 6 | 0 |
ethnicity_race_specify | 3937 | 0.03 | 2 | 53 | 0 | 82 | 0 |
employment_status | 623 | 0.85 | 7 | 18 | 0 | 6 | 0 |
number_children | 636 | 0.84 | 1 | 11 | 0 | 5 | 0 |
political_affiliation | 753 | 0.81 | 8 | 14 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
expertise | 104 | 0.97 | 5.69 | 1.95 | 1 | 5 | 6 | 7 | 10 | ▂▃▇▇▁ |
coffee_a_bitterness | 244 | 0.94 | 2.14 | 0.95 | 1 | 1 | 2 | 3 | 5 | ▅▇▃▂▁ |
coffee_a_acidity | 263 | 0.93 | 3.63 | 0.98 | 1 | 3 | 4 | 4 | 5 | ▁▂▅▇▃ |
coffee_a_personal_preference | 253 | 0.94 | 3.31 | 1.19 | 1 | 2 | 3 | 4 | 5 | ▂▅▆▇▅ |
coffee_b_bitterness | 262 | 0.94 | 3.01 | 0.99 | 1 | 2 | 3 | 4 | 5 | ▂▅▇▆▁ |
coffee_b_acidity | 275 | 0.93 | 2.22 | 0.87 | 1 | 2 | 2 | 3 | 5 | ▃▇▅▁▁ |
coffee_b_personal_preference | 269 | 0.93 | 3.07 | 1.11 | 1 | 2 | 3 | 4 | 5 | ▂▆▇▆▃ |
coffee_c_bitterness | 278 | 0.93 | 3.07 | 1.00 | 1 | 2 | 3 | 4 | 5 | ▁▅▇▆▂ |
coffee_c_acidity | 291 | 0.93 | 2.37 | 0.92 | 1 | 2 | 2 | 3 | 5 | ▃▇▆▂▁ |
coffee_c_personal_preference | 276 | 0.93 | 3.06 | 1.13 | 1 | 2 | 3 | 4 | 5 | ▂▆▇▆▃ |
coffee_d_bitterness | 275 | 0.93 | 2.16 | 1.08 | 1 | 1 | 2 | 3 | 5 | ▇▇▅▂▁ |
coffee_d_acidity | 277 | 0.93 | 3.86 | 1.01 | 1 | 3 | 4 | 5 | 5 | ▁▂▃▇▆ |
coffee_d_personal_preference | 278 | 0.93 | 3.38 | 1.45 | 1 | 2 | 4 | 5 | 5 | ▅▃▃▆▇ |
Questions
The two questions you want to answer.
- How does favorite coffee genre (style, additions, sweetener etc.) vary in different demographic groups (e.g. age, gender, and employment status)?
- Does workstyle (work from home vs hybrid vs on-site) affect coffee consumption habits (amount consumed, willing to brew at home)?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
Q1 Variables Involved:
Dependent (Coffee Preferences): favorite, style , additions, sweetener, dairy, roast_level, strength, caffeine
Independent (Demographics): age, gender, employment_status, ethnicity_race
Variables to be Created:
- Coffee Genre Clustering: categorical clusters of coffee preferences based on style, additions, sweetener, dairy, and roast_level. For example: Classic Black (No additions, dark roast); Sweetened Coffee (Uses sugar/syrup and milk); Specialty Coffee Lovers (Preference for espresso-based drinks)
Our plan is to calculate the frequency of different coffee preferences across demographics by generating summary statistics for each coffee preference variable and visualizing them with a stacked bar charts. Also to compare coffee preferences (e.g., style, roast_level) across different age groups, and genders. To avoid overloading graphs, we will used the clustered coffee types to represent combinations of coffee preferences instead of looking at all of them separately. We may also include faceting to better view differences between groups, if needed.
Q2: Variables Involved
Dependent(Coffee Consumption Habits): cups, brew, purchase, total_spend, spent_equipment
Independent Variable (Work style): wfh
Variables to be Created:
Binarized Work style Categories: Brew at Home vs. Purchase Habit (categorizes: Home Brewers (Primarily brew coffee at home), Buyers (Primarily purchase coffee outside))
Spending Habits Per Cup (average coffee spending per cup: total_spend / cups_per_month). The plan is to calculate average cups consumption per work style category.
Compute spending habits (total_spend, spent_equipment) by work style.
We will investigate how much people spend on coffee, both in the sense of buying it from cafes and in the sense of their home equipment, based on whether they work from home or in person. We will visualize this using bar charts / Violin plots to show distributions of total_spend (amount spent on coffee) and spent_equipment by work style.