An Examination of International Cuisines through Unsupervised Learning

Ben Sturm
Towards Data Science
8 min readJul 12, 2018

--

Typical Tyrolean Käsespätzle (Image courtesy: https://austria-forum.org/)

Like a lot of people, I’m a big fan of food. I was very lucky to be raised in a home where meals made from scratch were the norm. My mom did all of the cooking and because she immigrated to the US from Germany, I was exposed to a lot of delicious German dishes. Some of my favorites included Käsespätzle, Semmelknödel, and Sauerbraten. Although I never could claim to have the cooking talents of my mom, I do very much enjoy the process of making a meal from scratch, and of course sharing that meal with my own family.

With this as my background story, I thought it would be really interesting to conduct a data science project involving recipes from around the world. I wanted to see if I could learn something about the relationships of different cuisines throughout the world. In order to explore this topic, I gathered recipe data from over 12,000 different recipes representing 25 different cuisine types. I then did natural language processing (NLP) to convert the text data into a format that could be fed into a machine learning algorithm. Finally, I did principal component analysis (PCA) and topic modeling to gain insights of the data.

Data Collection

The recipe data I used for this project came from Yummly. I was granted a student license to their API (Thanks, Yummly!), that allowed me to do queries and search for recipes directly from an ipython notebook. Yummly supports doing searches based on cuisine type. The following is the supported list of cuisines:

  • American, Italian, Asian, Mexican, Southern & Soul Food, French, Southwestern, Barbecue, Indian, Chinese, Cajun & Creole, English, Mediterranean, Greek, Spanish, German, Thai, Moroccan, Irish, Japanese, Cuban, Hawaiian, Swedish, Hungarian, Portuguese

In total, I downloaded approximately 500 recipes for each of the 25 cuisines supported. This lead to ~12,500 different recipes. For the data collection, I used the Requests library to read in the data, and the builtin JSON encoder to convert the JSON data into a python dictionary. Then, it was relatively straightforward to convert the data into a Pandas DataFrame. A screenshot from a few selected lines of the DataFrame are shown below.

Selected rows of the Yummly recipe DataFrame

For my analysis, I used the columns corresponding to cuisine and ingredients only. All other columns were ignored.

Text Data Processing and Machine Learning Workflow

Since the data consisted of text only, it was necessary to implement a number of preprocessing steps using NLP techniques. These steps are as follows:

  • Hyphenating certain ingredients (e.g., olive-oil, corn-starch)
  • Tokenization to split the ingredients into a list of words
  • Removing stop words and other frequently occurring words (e.g., salt, pepper, water)
  • Stemming by dropping plural forms of words and other suffixes
  • Bag-of-words processing to create a sparse matrix consisting of all the words in the ingredients list and how often they appear

The tools I used to implement the previous list of steps included TfidfVectorizer and CountVectorizer found in sklearn. Some of these steps, such as hyphenation and stop word removal, required that I write my own code to implement since these steps were more unique to this specific use case. The reader is encouraged to check out my Github repo for this project to learn more.

The machine learning algorithms I focused on were all unsupervised learning algorithms. I did k-Means Clustering to see if I could cluster recipes together based on the cuisine type, however clustering wasn’t super helpful for my analysis, because it was unclear what the different clusters represented. Instead, I focused my attention on principal component analysis (PCA) as well as Latent Dirichlet Allocation (LDA) which I will discuss further in the Results section.

Results

To be able to visualize my data, I implemented dimensionality reduction in order to reduce the feature space from 1982 dimensions, which correspond to the number of different ingredients in my dataset, to 2 dimensions. This step was done using PCA and I retained the top two principal components. Then, I created a scatter plot for each recipe against the first and second principal component, which is displayed below.

A scatter plot consisting of all 12,492 recipes along the first and second principal components.

When plotting all of the recipes in this 2-D principal component space, I didn’t learn a lot, because many of the data points were overlapping, so it was difficult to see any structure in the data. However, by grouping the recipes based on the cuisine and taking the centroid values along the first two principal components, I could see some interesting structure in the data. A plot of this is shown below.

A plot of the centroid values for each of the different cuisines along the first and second principal components. Group (A) is associated with Asian cuisines, (B) consists of Japanese and Hawaiian cuisines, (C) and (D) are European and American cuisines, respectively. Group (E) is a mixed bag of cuisines from all over the world including Cuban, Mexican, Indian, and Spanish.

The plot above provides some interesting insight regarding the relationship of different cuisines. We can observe that the centroid values tended to group the recipes based on similar cuisine types. For instance, group (A) in Figure 2 consists of Chinese, Thai, and Asian, which could all be classified as Asian foods. Group (B) consists of Japanese and Hawaiian cuisines. Both of these cuisines place a strong emphasis on fish, so it makes sense that they are closely grouped together. Group (C) consists exclusively of European cuisines such as Swedish, French, and German, and not very far away is group (D) which consists of mostly North American cuisines. These include Southern, Barbecue, and traditional American. Lastly, group (E) is a mixed bag of many different cuisines from all over the world. This included Cuban, Mexican, Indian, Spanish, and Southwestern. When I think of these cuisines, I think of big, bold flavors, so it makes perfect sense that these cuisines would be closely grouped together.

A question the reader may ask is, which features (i.e., ingredients) are most strongly linked to the first and second principal components? This can be visualized in the following figure.

A plot demonstrating the ingredients most strongly associated with the first and second principal components.

This plot provides a visual representation of the dominate features for each of the two principal components. Ingredients such as chicken, garlic, onion, and tomato have strong associations along the positive direction of component one. These flavors have strong ties to cuisines such as Spanish or Indian. On the other hand, eggs, butter, flour, milk and sugar have strong associations along the negative direction of component one. These are ingredients found typically in French or English dishes. Likewise, soy, sauce, and rice have strong associates with the positive direction of component two. These ingredients are common in Asian cuisines. Finally, cheese, lemon, olive oil and tomato have strong associations with the negative direction of component two. These flavors are very common in Italian and Greek cuisines. This figure helps to explain the structure of the previous figure, namely why certain cuisines were clustered together in particular regions when plotted along the first and second principal components.

Finally, I also ran an LDA model in order to do topic modeling (see below). I was curious to see if it was possible to separate out the different ingredients based on the cuisine that they typically belong to. I specified the number of topics to be 25, because I knew there were 25 different cuisines represented in my dataset. The results of LDA were a bit messy, however. In certain cases, the LDA topics were particular cuisines such as Italian or Thai. However, some of the topics were different categories of dishes such as desserts, sauces, or even cocktails. Although this result was not what I had intended, in retrospect it actually makes perfect sense. LDA is a machine learning technique that identifies groups of words that appear together frequently. So, in a corpus of over 12,000 recipes, there might be a stronger association of groups of words based on the type of dish (i.e. dessert, soup, salad, or sauce) versus the type of cuisine.

Results from LDA topic modeling.

Closing Thoughts

I had a lot of fun exploring the recipe dataset, because I really enjoyed combining my love for food with the new skills I have obtained in data science. One could also make a strong business case for this type of analysis, since this information could be used to provide recipe recommendations to users of the Yummly platform. For instance, somebody who really enjoys barbecue food may also really enjoy Portuguese cuisine, since these two cuisine types overlap when plotting their centroids along the first and second principal components. A relationship such as this is one I wouldn’t have foreseen without exploring the data.

It’s important to disclose, however, that the data used in this blog is not without flaws. For instance, Yummly is basically a website that aggregates recipes from many other recipe sites or blogs, all of which are in English. Hence, many of these recipes might be an American’s take on other cuisine types. I’m pretty sure that my Italian friends would say that a recipe combining chicken and pesto such as this is “not Italian!”, even though Yummly tags this recipe as Italian. Possibly a better way of dealing with this issue is to find recipes in their native language and using some advanced translation algorithms to translate them into English. However, since ingredients can be so specific to a particular geographic region, this may also cause unintended problems. For instance, certain ingredients may not have an English equivalent (e.g., Prosciutto di Parma, Parmigiano Reggiano) and therefore these ingredients will always be associated with the Italian cuisine. This could possibly result in finding fewer similarities between the different types of cuisines, but is still something I believe is worth exploring.

Thanks so much for reading my blog post. If you’d like to explore my code for this project, you can find it at the following Github repo. Also, feedback would be very much appreciated, so please feel free to reach out to me or give me a “Clap” if you enjoyed this post.

--

--

Data Scientist with significant experience gained during the 12-week Metis Data Science immersive. Holds a Ph.D. in Nuclear Engineering.