library(tidyverse)
library(lubridate)Project proposal
Dataset
cuisines_df <- read_csv("data/cuisines.csv")Rows: 2218 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, country, url, author, ingredients
dbl (11): calories, fat, carbs, protein, avg_rating, total_ratings, reviews...
date (1): date_published
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Description
This project uses data scraped from https://www.allrecipes.com/ with 2218 rows and 17 columns. The dataset contains two complementary tables: allrecipes with 14,426 general recipes, and cuisines with 2,218 recipes categorized by country of origin. Both tables include comprehensive recipe information such as ingredients, nutritional facts (calories, fat, carbs, protein), cooking times (preparation and cooking), ratings, and review metadata. For our purposes, only the cuisines table will be used. The collected data has entries from 2009 to 2025 and includes cuisines from 49 countries.
Structure
The important variables used for this project are:
date_published (date): When the recipe was published/updated
ingredients (character): The ingredients of the recipe
country (character): The country/region the cuisine is from.
calories (integer): Calories per serving
fat (integer): Fat per serving
protein (integer): Protein per serving
carbs (integer): “Carbohydrates per serving”
Why this dataset?
As avid food enthusiasts, a cuisine-focused dataset immediately stood out to us, as we were interested in exploring the diversity of foods prepared across different countries.
Notes
All usages of the term “cuisine” in this project refer to the ‘country’ column of the dataset.
Questions
How have the top 10 most-used ingredients changed over time across the top most frequent cuisines?
How do macronutrients (protein, carbs, fats) compare across different cuisines?
Analysis plan
Answering Question 1
How have the top 10 most-used ingredients changed over time across the top most frequent cuisines?
Variables to be used :
date_published,ingredients,countryGet the top 5 most frequent cuisines: Calculate the frequency count for the number of times each cuisine appears. Filter by the top 5 frequencies. count(country, sort = TRUE) |> slice_head(n = 5)
Clean the ingredients column: In the original dataset, each cell in the ingredients column contains a comma-separated list of ingredients. To extract the individual ingredients for each recipe, we plan to split the values using a comma delimiter. Then, we will remove measurements (numbers), additional characters such as parentheses, dashes, and slashes, and any words indicating measurements or descriptions, such as “tablespoon” and “cubed”.
Extract top ingredients per country and date_published: Group by
country, and get the frequency count for each ingredient across all recipes within that cuisine. Store the top 10 ingredients by frequency count in a new data frame, called top_10, grouped by both thedate_publishedandcountry. Where each top-10 ingredient is stored in a separate enumerated column, such as ingredient_1.Extract year column: Extract year from “date_published” column using the lubridate package.
Potential visualizations and statistical Analysis:
- Facetted Stacked Bar graphs
- x: year
- y: frequency
- facet: country
- fill: ingredient
- Heatmap
- x: year
- y: frequency
- fill: ingredient
- facet: country
Answering Question 2
How do macronutrients (protein, carbs, fats) compare across the highest and lowest rated cuisines?
Variables to be used:
calories,fat,protein,carbs,countryFilter for cuisine extremes: We plan to select a subset of recipes from the highest- and lowest-rated cuisines and plot their macro-nutrient trends. First, group the dataset by ‘country’. Using the value in the ‘avg_rating’ column for each recipe within a country, calculate the average rating across all recipes to get the average rating for the cuisine. Store this value in a new column called cuisine_avg_rating. Select the first 5 top-rated countries’ cuisines and the lowest rated 5 countries.
Calculate proportions: We calculate macronutrient proportions using calories rather than grams because fat, protein, and carbohydrates contribute different amounts of energy per gram. Fat provides 9 calories per gram, while protein and carbs provide 4. By converting grams to calories and dividing by the total calories, we quantify each macronutrient’s contribution to the recipe’s total energy, enabling more accurate comparisons across recipes and over time.
Create proportion_fat column: multiply the grams of fat by 9 and divide by the calories: e.g., if recipe A has 300 calories and 10 grams of fat, then: (10*9) / 300 = 0.3 = 30%. Fat comprises 30% of the total calories of recipe A.
Create proportion_protein column: multiply the grams of protein by 4 and divide by the calories.
Create proportion_carbs column: multiply the grams of carbs by 4 and divide by the calories.
Summarize: Create a new summarized data frame
Sort each country by cuisine_avg_rating. Create a new column called top_vs_bottom. In this column: Assign “Top” to the top 5 countries (by index), and assign “Bottom” to the bottom 5 cuisines.
For each cuisine, compute and add columns: average(proportion_fat across all recipes in that cuisine) average(proportion_protein across all recipes in that cuisine) average(proportion_carbs across all recipes in that cuisine) And store as avg_prop_fat, avg_prop_carbs,avg_prop_protein
Potential visualizations and statistical Analysis:
- Faceted Bar Charts
- x: country
- y: avg_proportion
- fill: top_vs_bottom
- facet: macronutrient