Project proposal

Author

Dank-Bop (Abigail Grizancic, Ava Zhang)

library(tidyverse)
#install.packages("skimr")

#renv::restore()

#saves to data file
#coffee_survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-05-14/coffee_survey.csv')
#write_rds(coffee_survey, "data/coffee-survey.rds")

coffee_survey <- read_rds("data/coffee-survey.rds")

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

This dataset is the “Great American Coffee Taste Test” which comes from a Youtube livestream hosted by Cometeer, a coffee company, and James Hoffmann, a renowned barista. 4042 people filled out a survey about their coffee habits, and their thoughts about the 4 Cometeer coffees offered at the tasting. This data set includes 57 columns, which include personal questions as well as questions about the coffees taste tested. Of the columns in the dataset, 44 are character columns and 13 are numeric columns. Below you can see more information about the columns, including the number of missing entries and a simplified distribution. We chose this data set because, as college students, we drink a lot of coffee and would like to investigate trends in other coffee drinkers’ preferences to consume better coffee.

In the introduction to the report, we will further explore the demographics of the taste testers from this data set, to see how representative they are of the U.S. as a whole. This will allow us to avoid generalizations of our findings if they are not representative.

skimr::skim(coffee_survey)

Data summary
Name	coffee_survey
Number of rows	4042
Number of columns	57
_______________________
Column type frequency:
character	44
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
submission_id	0	1.00	6	6	4042
age	31	0.99	13	15	7
cups	93	0.98	1	11	6
where_drink	70	0.98	7	44	65
brew	385	0.90	5	165	449
brew_other	3364	0.17	2	319	160
purchase	3332	0.18	5	107	89
purchase_other	4011	0.01	4	83	26
favorite	62	0.98	5	32	12
favorite_specify	3926	0.03	3	92	78
additions	83	0.98	5	100	53
additions_other	3994	0.01	3	140	42
dairy	2356	0.42	8	110	175
sweetener	3530	0.13	5	99	82
style	84	0.98	4	11	12
strength	126	0.97	4	15	5
roast_level	102	0.97	4	7	7
caffeine	125	0.97	5	13	3
coffee_a_notes	1464	0.64	1	377	2317
coffee_b_notes	1586	0.61	1	980	2199
coffee_c_notes	1659	0.59	1	438	2163
coffee_d_notes	1454	0.64	1	528	2354
prefer_abc	270	0.93	8	8	3
prefer_ad	281	0.93	8	8	2
prefer_overall	272	0.93	8	8	4
wfh	518	0.87	18	26	3
total_spend	531	0.87	4	8	6
why_drink	474	0.88	5	93	84
why_drink_other	3875	0.04	2	195	163
taste	479	0.88	2	3	2
know_source	483	0.88	2	3	2
most_paid	515	0.87	5	13	8
most_willing	532	0.87	5	13	8
value_cafe	542	0.87	2	3	2
spent_equipment	536	0.87	7	16	7
value_equipment	548	0.86	2	3	2
gender	519	0.87	4	22	5
gender_specify	4030	0.00	2	28	11
education_level	604	0.85	15	34	6
ethnicity_race	624	0.85	15	29	6
ethnicity_race_specify	3937	0.03	2	53	82
employment_status	623	0.85	7	18	6
number_children	636	0.84	1	11	5
political_affiliation	753	0.81	8	14	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
expertise	104	0.97	5.69	1.95	1	5	6	7	10	▂▃▇▇▁
coffee_a_bitterness	244	0.94	2.14	0.95	1	1	2	3	5	▅▇▃▂▁
coffee_a_acidity	263	0.93	3.63	0.98	1	3	4	4	5	▁▂▅▇▃
coffee_a_personal_preference	253	0.94	3.31	1.19	1	2	3	4	5	▂▅▆▇▅
coffee_b_bitterness	262	0.94	3.01	0.99	1	2	3	4	5	▂▅▇▆▁
coffee_b_acidity	275	0.93	2.22	0.87	1	2	2	3	5	▃▇▅▁▁
coffee_b_personal_preference	269	0.93	3.07	1.11	1	2	3	4	5	▂▆▇▆▃
coffee_c_bitterness	278	0.93	3.07	1.00	1	2	3	4	5	▁▅▇▆▂
coffee_c_acidity	291	0.93	2.37	0.92	1	2	2	3	5	▃▇▆▂▁
coffee_c_personal_preference	276	0.93	3.06	1.13	1	2	3	4	5	▂▆▇▆▃
coffee_d_bitterness	275	0.93	2.16	1.08	1	1	2	3	5	▇▇▅▂▁
coffee_d_acidity	277	0.93	3.86	1.01	1	3	4	5	5	▁▂▃▇▆
coffee_d_personal_preference	278	0.93	3.38	1.45	1	2	4	5	5	▅▃▃▆▇

Questions

The two questions you want to answer.

How does favorite coffee genre (style, additions, sweetener etc.) vary in different demographic groups (e.g. age, gender, and employment status)?
Does workstyle (work from home vs hybrid vs on-site) affect coffee consumption habits (amount consumed, willing to brew at home)?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Q1 Variables Involved:

Dependent (Coffee Preferences): favorite, style , additions, sweetener, dairy, roast_level, strength, caffeine
Independent (Demographics): age, gender, employment_status, ethnicity_race
Variables to be Created:
- Coffee Genre Clustering: categorical clusters of coffee preferences based on style, additions, sweetener, dairy, and roast_level. For example: Classic Black (No additions, dark roast); Sweetened Coffee (Uses sugar/syrup and milk); Specialty Coffee Lovers (Preference for espresso-based drinks)

Our plan is to calculate the frequency of different coffee preferences across demographics by generating summary statistics for each coffee preference variable and visualizing them with a stacked bar charts. Also to compare coffee preferences (e.g., style, roast_level) across different age groups, and genders. To avoid overloading graphs, we will used the clustered coffee types to represent combinations of coffee preferences instead of looking at all of them separately. We may also include faceting to better view differences between groups, if needed.

Q2: Variables Involved

Dependent(Coffee Consumption Habits): cups, brew, purchase, total_spend, spent_equipment
Independent Variable (Work style): wfh
Variables to be Created:
- Binarized Work style Categories: Brew at Home vs. Purchase Habit (categorizes: Home Brewers (Primarily brew coffee at home), Buyers (Primarily purchase coffee outside))
- Spending Habits Per Cup (average coffee spending per cup: total_spend / cups_per_month). The plan is to calculate average cups consumption per work style category.
- Compute spending habits (total_spend, spent_equipment) by work style.

We will investigate how much people spend on coffee, both in the sense of buying it from cafes and in the sense of their home equipment, based on whether they work from home or in person. We will visualize this using bar charts / Violin plots to show distributions of total_spend (amount spent on coffee) and spent_equipment by work style.