1000 Coffee for 1000 Coffee Lovers

Author

Dank Bop (Ava Zhang, Abigail Grizancic, Mohammad Hamzah)

Introduction

Our study, 1000 Coffee for 1000 Coffee Lovers, surveys coffee behavior from a respondent base that’s representative of a number of various demographics. The dataset, known as coffee_survey, is a tidy tuesday dataset consisting of 4,042 observations and 57 variables. It captures a wide range of information related to coffee consumption, including respondents’ coffee preferences (e.g., favorite type, brew method, strength, roast level), spending habits (such as total monthly coffee spend and spending on at-home equipment), and detailed demographic data (age, gender, ethnicity, employment status, and more). This structure will allow us to provide an analysis of how coffee-related behaviors and spending patterns vary across different demographic segments of the population. We plan to focus our analysis on the differences in coffee preference for different demographic groups, as well as different spending habits based on different work modalities.

In order to ensure that our study is generalizable to the national population, we compared the demographic composition of those who participated in the survey (ie, age, race, sex, and work status) to that which exists on a national U.S. level. Initial observations suggest that we’ve got a slightly young sample, where there’s excess representation in the 18–34 age bracket compared to national median age. In terms of gender, our survey has a slightly higher percentage of women compared to national estimates, and the racial profile has some differences; for instance, whereas the U.S. Census data indicate approximately 76% of the population to be White, our sample has a slightly lower percentage and higher relative coverage among minority groups. Employment status also varies from the broader U.S. distribution since our respondents are predominately full-time employees with fewer part-time employees and retirees than would be found nationwide. Understanding these demographic differences is crucial since they directly affect the interpretation of our analysis of coffee usage and purchase habits. If our sample is not generalizable to the entire U.S. population, then any conclusions drawn will be biased or narrow in application. By cross-referencing our sample with national demographic statistics, we are able to better position our results and correct sampling inaccuracies. Such a comparison is particularly important to make in the case of coffee industry and related market stakeholders who must have accurate consumer profiles to inform product development and marketing efforts.

# prepare data
demographics <- coffee_survey |>
  select(age, ethnicity_race, gender, employment_status) |>
  filter(!is.na(age), !is.na(ethnicity_race), !is.na(gender), !is.na(employment_status))

# pivot the data to long format
demographics_summary <- demographics |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") |>
  group_by(variable, value) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(variable) |>
  mutate(proportion = count / sum(count)) |>
  ungroup()

# summary statistics
ggplot(demographics_summary, aes(x = value, y = proportion, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ variable, scales = "free_x") +
  labs(
    title = "Summary Statistics for Key Demographics in the Coffee Survey",
    x = "Category",
    y = "Proportion"
  ) +
  scale_y_continuous(labels = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position='none')

Question 1 <- How Do Coffee Preferences Vary Across People of Different Age and Employment Status?

Introduction

Coffee is an integral part of daily life for many people, but preferences for different coffee types and characteristics can vary significantly across age and employment groups. Factors such as cultural trends, lifestyle habits, and exposure to specialty coffee may shape these preferences. To explore these trends, we utilize demographic variables (age, employment status) along with coffee-related attributes such as:

  • Favorite coffee type (e.g., latte, espresso, cold brew)

  • Coffee strength (somewhat light, medium, very strong)

  • Roast level (light, medium, dark, etc.)

  • Caffeine preference (decaf, half-caf, full caffeine)

We are interested in this question because it can help us understand how our beloved coffees evolve over time and are shaped by lifestyle factors. By analyzing these relationships, we aim to uncover whether coffee choices are a reflection of personal taste, work routines, or generational trends.

Approach

To analyze this question, we are using a heatmap and a colored stacked barchart to group demographic traits with respective coffee traits in hope of discovering more specific trends. Both figures represents frequency with a normalized proportional approach which ensures that age groups and employment groups with different sample sizes are fairly represented, allowing us to compare trends across all age brackets.

Figure 1:

  • X-axis: Favorite coffee type

  • Y-axis: Age group

  • Fill color: Proportion of individuals within that age group who favor each coffee type

Figure 2:

  • X-axis: Coffee trait (Strength, Roast Level, or Caffeine)

  • Y-axis: Proportion of people within each employment category who prefer each trait

  • Fill color: Employment status

Analysis

coffee_cleaned <- coffee_survey |> 
  select(age, gender, employment_status, strength, roast_level, caffeine, favorite) |> 
  filter(!is.na(strength) & !is.na(roast_level) & !is.na(caffeine) & !is.na(favorite) & !is.na(employment_status) & !is.na(gender) & !is.na(age)) |> 
  mutate(
    employment_status = factor(employment_status)
  )
# total count per age group
age_group_totals <- coffee_cleaned |> 
  count(age) |> 
  rename(total_n = n) 

# coffee type count within each age group
age_favorite_heatmap <- coffee_cleaned |> 
  count(age, favorite) |> 
  left_join(age_group_totals, by = "age") |> 
  mutate(prop = n / total_n) |>  # proportion 
  complete(age, favorite, fill = list(prop = 0))  # Fill missing with 0 

age_favorite_heatmap <- age_favorite_heatmap |> 
  mutate(
    age = factor(age, levels = c("<18 years old", "18-24 years old", "25-34 years old", "35-44 years old", "45-54 years old", "55-64 years old", ">65 years old"))
  )

ggplot(age_favorite_heatmap, aes(x = favorite, y = age, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "beige", high = "brown", labels = scales::percent) +
  labs(
    title = "Favorite Coffee Type by Age Group",
    x = "Favorite Coffee Type",
    y = "Age Group",
    fill = "Proportion"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels for readability
    legend.position = "right"
  )

# Filtering and preparing data
cleaned_data <- coffee_cleaned |> 
  filter(!is.na(age), !is.na(employment_status), 
         !is.na(strength), !is.na(roast_level), !is.na(caffeine)) 

# Ordering categorical variables
cleaned_data <- cleaned_data |> 
  mutate(
    employment_status = factor(employment_status, levels = c("Student", "Employed part-time", "Employed full-time", "Homemaker", "Retired", "Unemployed")),
    age = factor(age, levels = sort(unique(age))), 
    strength = factor(strength, levels = c("Weak", "Somewhat light", "Medium", "Somewhat strong", "Very strong")),
    roast_level = factor(roast_level, levels = c("Blonde", "Light", "French", "Nordic", "Medium", "Dark", "Italian")),
    caffeine = factor(caffeine, levels = c("Decaf", "Half caff", "Full caffeine"))
  )

# Transforming data for stacked bar plot
coffee_traits_stacked <- cleaned_data |> 
  pivot_longer(cols = c(strength, roast_level, caffeine), 
               names_to = "trait", 
               values_to = "preference") |> 
  count(employment_status, trait, preference) |> 
  group_by(employment_status, trait) |> 
  mutate(prop = n / sum(n)) |> 
  ungroup()

custom_colors <- c(
  # Honey/Golden Shades (5)
  "#FFD699", "#FFC266", "#FFA31A", "#E68A00", "#CC7000",
  
  # Brown/Coffee Shades (6)
  "#D2B48C", "#C19A6B", "#A67B5B", "#8B5A2B", "#734A12", "#5A3312",
  
  # Red/Cinnamon Shades (3)
  "#E97451", "#C04000", "#8B0000"
)

ggplot(coffee_traits_stacked, aes(x = employment_status, y = prop, fill = preference))+
  geom_col(position = "fill") +  
  facet_wrap(~trait, scales = "free_y") + 
  labs(
    title = "Strength, Roast Level, and Caffeine Preference by Employment Status",
    x = "Employment Status",
    y = "Proportion",
    fill = "Preference"
  ) +
  scale_y_continuous(labels = scales::percent) +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right"
  )+ 
  scale_fill_manual(values = custom_colors) + 
  guides(fill = guide_legend(reverse=T))

Discussion

We find that:

  • Younger respondents (<18 years old) have a strong preference for sweet, milk-based coffee drinks, such as lattes (50% frequency of chosen favortie drink).Older age groups (>65 years old) show a preference for traditional coffee types like regular drip coffee and espresso.

  • The 18-24 and 25-34 age groups show the most diverse range of preferences, with cold brew, cappuccino, and iced coffee appearing frequently.

  • Pourover coffee and mochas are moderately popular across all age groups, but more prevalent among younger drinkers.

  • Students are more likely to drink full-caffeine coffee, preferring lighter roasts.

  • Retirees and homemakers are more likely to consume decaf coffee, preferring darker roasts.

  • Full-time workers tend to favor medium and strong coffee, with a high preference for dark roasts. Part-time workers have similar preferences to full-time workers but show a slightly more diverse range in roast levels.

These findings suggest that different generation (age group) has very distinct coffee preferences trends, which could be a result of economic changes, cultural advertisement, and coffee industry prgress combined. Employment status also significantly influences coffee consumption habits. People with demanding schedules (students, full-time workers) favor high-caffeine coffee, while those with more flexibility (retired individuals, homemakers) may opt for lower-caffeine alternatives.

Question 2 <- How does employment modality affect coffee-based spending?

Introduction

In question 1, we looked into both age and employment status in relation to coffee preferences. For this question we plan to investigate how work modality (work from home vs. in-office) affects coffee spending, both at cafes and in-home equipment. We also plan to investigate overall spending per cup of coffee changes based on work modality. These investigations could open show new possibilities for the people to whom companies should market certain equipment or drinks.

We will use the wfh column to define work modality. To define spending on coffee using the total_spend column and spending on coffee equipment using the spend_equipment column. To determine the estimated spending per cup of coffee, we will divide total_spend (the total amount spent on coffee in a month) by cups (number of cups drank per day).

Approach

For the first plot investigating spending on coffee and spending on coffee equipment based on work modality will be a filled bar chart. This is the best plot type for showing this information because they provide amount spent on coffee and equipment as categories. Filling prevents the amount of people in each work modality from affecting the shape of the distribution.

  • X: spending ($)

  • Y: proportion

  • color: modality

  • facet: coffee (monthly), equipment (5yrs)

Due to difficulties with most of the data being categorical, to determine price per cup we will take the middle value for each grouping and treat it as a true numeric value. This will be compared to the people who said they make 1-4 cups per day as those are provided in numbers and are easier to estimate cups per month from, compared to categories like <1 and >5. Once we divide overall spending by cups per month, we will use a violin plot to show spending based on work modalities with notations about the mean values. This is the best option because it provides an idea of the true distribution of coffee spending based on work modality.

  • X: amount spent per cup ($)

  • Y: work modality

  • color: work modality (for continuity with graph 1)

Analysis

(2-3 code blocks, 2 figures, text/code comments as needed) In this section, provide the code that generates your plots. Use scale functions to provide nice axis labels and guides. You are welcome to use theme functions to customize the appearance of your plot, but you are not required to do so. All plots must be made with ggplot2. Do not use base R or lattice plotting functions.

q2a_data <- coffee_survey |> filter(!(is.na(spent_equipment)) & !(is.na(wfh)) & !(is.na(total_spend)) & wfh != "I do a mix of both") |> 
  mutate(spent_equipment = as.factor(spent_equipment) |> fct_relevel(c("Less than $20", "$20-$50", "$50-$100", "$100-$300", "$300-$500", "$500-$1000", "More than $1,000")),
         total_spend = as.factor(total_spend) |> fct_relevel(c("<$20", "$20-$40", "$40-$60", "$60-$80", "$80-$100", ">$100")),
         wfh = case_when(
           wfh == "I primarily work from home" ~ "Work From Home",
           wfh == "I primarily work in person" ~ "In-Person"
         )) |>
  select(spent_equipment, wfh, total_spend)
  
  
at_home <- q2a_data |>
ggplot(aes(x=spent_equipment, fill=wfh)) +
geom_bar(position='fill') +
  labs(title = "Amount Spent on at-home equipment", 
       subtitle = "total over the last five years",
       x = "Amount Spent($)",
       y = "Proportion of respondants") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_manual(values = c("tan", "burlywood4"), guide = "none") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20))

total_spend <- q2a_data |>
ggplot(aes(x=total_spend, fill=wfh)) +
geom_bar(position='fill') +
  labs(title = "Amount Spent on coffee", 
       subtitle = "Average monthly amount",
       x = "Amount Spent($)",
       y = "Proportion of respondants", 
       fill="Work Modality") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_manual(values = c("tan", "burlywood4")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20))

  
  
grid.arrange(at_home, total_spend, ncol = 2, layout_matrix = rbind(c(1,1,1,2, 2, 2, 2), c(1,1,1,2,2,2,2)))

# Combine spending measures into one dataset
spend_combined <- coffee_survey |>
  filter(cups %in% c('1','2','3','4'),
         !is.na(total_spend),
         !is.na(spent_equipment),
         !is.na(wfh)) |>
  mutate(
    # Convert total_spend to a numeric value for spending per cup
    spend_num = case_when(
      total_spend == ">$100" ~ 110,
      total_spend == "$40-$60" ~ 50,
      total_spend == "$20-$40" ~ 30,
      total_spend == "$60-$80" ~ 70,
      total_spend == "<$20" ~ 10,
      total_spend == "$80-$100" ~ 90
    ),
    cups = as.numeric(cups),
    spend_cup = spend_num / (cups * 30),
    # Convert spent_equipment to numeric midpoints
    spent_equipment_num = case_when(
      spent_equipment == "Less than $20" ~ 10,
      spent_equipment == "$20-$50" ~ 35,
      spent_equipment == "$50-$100" ~ 75,
      spent_equipment == "$100-$300" ~ 200,
      spent_equipment == "$300-$500" ~ 400,
      spent_equipment == "$500-$1000" ~ 750,
      spent_equipment == "More than $1,000" ~ 1200
    ),
    # Recode work modality
    wfh = recode(wfh,
                 "I primarily work from home" = "Work From Home",
                 "I primarily work in person" = "In-Person")
  ) |>
  select(wfh, spend_cup, spent_equipment_num)

# Pivot data to long format so that we have a single spending value column
spend_long <- spend_combined |>
  pivot_longer(
    cols = c(spend_cup, spent_equipment_num),
    names_to = "spending_type",
    values_to = "spending_value"
  )

# Create faceted violin plots with overlaid boxplots
ggplot(spend_long, aes(x = spending_value, y = wfh, fill = spending_type)) +
  geom_violin(trim = FALSE, alpha = 0.7, scale = "width") +
  geom_boxplot(width = 0.1, outlier.shape = NA, position = position_dodge(width = 0.9), fill=NA) +
  facet_wrap(~ spending_type, scales = "free_x",
             labeller = as_labeller(c(spend_cup = "Spending per Cup", 
                                      spent_equipment_num = "At-Home Equipment Spending"))) +
  scale_fill_manual(
    values = c("spend_cup" = "tan", "spent_equipment_num" = "burlywood4"),
    guide = "none"
  ) +
  labs(
    title = "Spending Distributions by Work Modality",
    x = "Spending Value ($)",
    y = "Work Modality"
  ) +
  scale_x_continuous(labels = scales::label_dollar()) +
  theme_minimal()

Discussion

(1-3 paragraphs) In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by the plots. Speculate about why the data looks the way it does.

Our original prediction was that people who work from home would be more likely to spend a lot of money on at-home equipment, as they would be able to use it more and that people who work in-person would spend more money on coffee overall because they are grabbing to-go coffees at coffee shops. The graph on the left supports our theory, since there is a higher proportion of people who work from home in the higher brackets of spending than people who work in-person. However, it appears that people who work from home also spend more on coffee in general. Possibly people that work in-person are using their office coffee maker instead of purchasing it, or possibly people who have more expensive coffee equipment also buy more expensive coffee for it.

The second graph explores spending per cup based on work modality. Keeping with our earlier predictions, we would have expected people who work in person to spend more on coffee, due to getting more to-go coffees. This graph shows that actually people who work from home spend slightly more on coffee than people who work in person. It also shows us that most people spend very little per cup of coffee. This indicates to us that most of the people surveyed tend to make coffee at home instead of getting to-go coffees or going to coffee shops.

Here we calculate and graph summary statistics for our survey sample, which provides us with an unequivocal data-driven foundation upon which to assess how representative our data is of various demographics, and thus how generalizable it is.

Presentation

Our presentation can be found here.

Data

Include a citation for your data here. See http://libraryguides.vu.edu.au/c.php?g=386501&p=4347840 for guidance on proper citation for datasets. If you got your data off the web, make sure to note the retrieval date.

This analysis utilizes the coffee_survey dataset from the TidyTuesday project. In October 2023, “world champion barista” James Hoffmann and coffee company Cometeer hosted a survey on YouTube, where viewers provided tasting responses about 4 types of coffees they ordered from Cometeer. Data blogger Robert McKeon Aloe analyzed the data, and compiled the dataset the following month.

TidyTuesday. (2024, May 14). Coffee for 1000 Coffee Lovers [Dataset]. Retrieved from https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-14 on Feb 11, 2025.

References

List any references here. You should, at a minimum, list your data source.

  1. TidyTuesday. (2024, May 14). Coffee for 1000 Coffee Lovers [Dataset]. Retrieved from https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-14 on Feb 11, 2025.
  2. Cornell University. (n.d.). Data citation. Retrieved [Mar 6, 2025], from https://data.research.cornell.edu/data-management/storing-and-managing/data-citation/.