Genomics of Cuisine
Access our shiny app here
Introduction
Every friend group has the person who eyes the menu suspiciously and quietly suggests going somewhere more familiar instead. Picking a restaurant that satisfies everyone is a surprisingly hard social problem, and at its core it may be knowledge problem: most people do not know enough about unfamiliar cuisines to feel confident trying them (whether that be unfamiliar food names or ingredients they’ve never heard of). Our project starts with this initial problem. We wanted to build a tool that could answer a simple but underexplored question: how similar is the food you already love to the food you’re never willing to try?
Food is an expressions of culture in which we can examine history, geography, trade, and identity together. And yet, despite the obvious regional character of cuisines, the quantitative structure of ingredient use across cultures remains surprisingly underexplored outside of academic food science. What makes Italian cuisine taste Italian? How similar are Korean and Japanese cooking, really? And why do some cuisines that developed thousands of miles apart share so surprisingly similar profiles?
These questions motivated our project: an interactive Shiny dashboard that lets users explore the ingredient landscapes of 20 global cuisines, compare them against each other, and find the hidden structure underneath the food we eat.
Research questions guiding the project:
- What are the most distinctive ingredients for each cuisine, and how do cuisines differ in their ingredient diversity?
- Which cuisines are most culinarily similar, and does geographic proximity predict culinary similarity?
- How are ingredients connected and what does the resulting network look like?
- How does any individual ingredient distribute across cuisines, and what are its most common culinary partners?
- Where in the world do culinarily similar cuisines live, and do unexpected cross-cultural cousins exist?
Justification of Approach
The Data
The project draws on three complementary datasets. The primary source is the Kaggle “What’s Cooking?” dataset from Yummly, which contains 39,774 recipes labeled by cuisine, each with by a list of ingredients as they appear in the recipe text. After loading and parsing the data, we found 6,714 unique ingredients across 20 cuisines. The second source is the FlavorDB ingredient taxonomy from the dataset 02_Ingredients.csv, which maps ingredient names to a category like spice, herb, dairy, meat, seafood, etc. We used this dataset to encode our ingredients by category in one of our tables the cuisine explorer. The third source is the Ahn et al. (2011) flavor compound backbone (backbone.csv), which was pulled from a research paper “Flavor network and the principles of food pairing” published in Scientific Reports. This dataset encodes edges between ingredients that share flavor compounds, with edge weights reflecting the number of shared compounds. It powers the interactive ingredient network visualization.
Our Final Product
The primary deliverable is a multi-tab R Shiny application with five main tabs. Cuisine Overview includes a summary of the dataset including recipe counts per cuisine, average ingredients per recipe, a heatmap of pairwise Jaccard similarity across all cuisines, and a grid of signature ingredient bar charts calculated using TF-IDF.
For the Cuisine Family Tree we created a family tree using clustering (UPGMA / average-linkage) by looking at binary ingredient vectors to group cuisines by overall ingredient overlap. We created a dendrogram with the final cuisines color-coded by geographic region. We also added a scatter plot identifying geographic distance and plotting that against to Jaccard similarity to see if we could fine surprisingly similar cuisine pairs that have a lot of similar ingredients even if they were not geographically connected. We also visualized that in a table format showing the top 15 cuisine pairs with the highest similarity score and furthest distance.
For the Cuisine Explorer we wanted to let users look into a specific cuisine and also compare it to another. We created a radar chart to show ingredient category proportions compared to the average of the dataset and a donut chart of ingredient breakdown and how exclusive it is compared to ingredients found in most/every cuisines. We also created car charts to show the signature ingredients calculated with tf-idf to each cuisine and a bar chart of shared ingredients to look at.
The How Ingredients Connect tab we wanted to try implementing something new and created a network graph using visNetwork. This was a new package that we learned to implement. The nodes are are the ingredients and the lines drawn show connections between shared flavor compounds. We added two filters 1) one to show the number of lines/connections shown at a time and 2) a filter to remove pairs that are less interrelated based on flavor compound.
For the Ingredient Explorer tab we wanted to let users to search or select any ingredient from the dataset and view: how frequently it appears in each cuisine’s recipes , a insight summary, and its top ingredient pairs.
Finally for the Global Cuisine Map tab we wanted to utilize Leaflet to give a visualization for users that was new to us to learn and new for them to interact with. Leaftlet helped us render a clickable interactive map. Clicking a country highlights all other cuisines shaded by Jaccard similarity to the selected one, with a ranked sidebar for easy comparison.
Design Decisions and Paths Not Taken
We also wanted to document our design process and also methods of displaying data. The first was utilizing TF-IDF over raw counts. When we initiially built out the dashboard we used raw counts for our visualizations to count but we noticed that more commonly used ingredients like garlic salt and onion would appear in basically every cuisine and not give us any insightful information. Swtiching to the tf-idf to pull out ingredient names gave us more insightful information that we utilized in a lot of tabs whether it be for the overview page with the more unique ingredients per cuisine (fish sauce for Vietnamese, garam masala for Indian, miso for Japanese), the top used ingredients in each cuisine, exploration, etc. This helped make our panels more informative. The other method of displaying data that we pivoted to was using Jaccard similarity over cosine. We considered cosine similarity for the cuisine comparison matrix but chose Jaccard (intersection over union of ingredient sets) because cosine similarity would be skewed towards cuisines that used a large set of ingredients since there would jsut be. more overlapping items in general. the Jaccard similarity was better at capturing the question of if the cuisine used that ingredient at all and less of how often it was used. For the family tree we decided to create a dendrogram over PCA or UMAP scatterplot for dimensionality reduction. While the scatterplot would have been more intuitive, the family tree would make it more clear to show the branches and the clusters of cuisines which are harder to make out in a 2D scatterplot.
In terms of design choices originally we wanted to create a color key for different continents where each cuisine originated however the coloring made all our charts difficult to discern especially with the sheer amount of colors. For example in the beginning we had the bar charts in the overview color coded by cuisine continent (asia, europe, etc) but the colors were overwhelming and not giving any insightful information on the graph. So we changed the colors to be one solid color and only color coded the ingredients by category for the network graph. We also left the color palette color coded for specific graphs that required it like the family tree (dendrogram). We also had to play around with organizing the tabs and where we wanted to place each tab of information. We ended up combining the cuisine exploratin and comparison into one tab because they represented similar information and we felt that there were a overwhelming amount of tabs.
Limitations
In terms of limitations we noticed there was representational bias. The Kaggle “What’s Cooking?” dataset comes from Yummly which is a U.S.-based recipe platform. This creates two forms of bias. First, the cuisine labels themselves reflect Western categorizations of what counts as a distinct cuisine like “Cajun/Creole” is treated as coordinate with “Chinese” or “Indian,” despite representing a regional subculture of a single country. Second, recipes made for more western hemisphere users skews toward dishes accessible to home cooks in the United States. This could cause underrepresentations of more regional recipies within each cuisine. An improvement to this project could compare this dataset against non-English or regionally-sourced recipe databases to assess how much the ingredient profiles shift.
Additionally we had issues with Ingredient normalization. Recipe ingredient lists are raw strings for example “large eggs,” “fresh garlic cloves,” “low-sodium soy sauce” which needed normalization before analysis. We worked on lowercases and trims whitespace but does not perform full entity resolution. “Chicken breast,” “boneless chicken,” and “chicken thighs” are treated as separate ingredients, potentially fragmenting what should be a single signal. The FlavorDB alias table (02_Ingredients.csv) partially addresses this, but its coverage is incomplete; many ingredient strings go unmatched and fall into the “Other” category in the radar chart.