Embedding “Contexts” Into Recipe Ingredients

Using Context Vectors Of Ingredients To Classify Cuisines

Nishant Mohan
Towards Data Science

--

Images: Pixabay

Looking at a set of ingredients, we can identify where the cuisine is from.

But, can machines do the same?

If I convert the ingredient list to a simple bag-of-words matrix, I get a one-hot-encoded matrix because each ingredient appears only once in a given recipe. Using simple linear regression, I get a classification accuracy of 78% on the Yummly recipes data-set without even cleaning the text. Interestingly, Tree-based methods and Neural Nets have also successfully been applied for this task with good results.

Can we use Word-Embeddings to improve those results?

Since NLP is schway these days, I wondered if Word-Embedding vectors could be used for the classification task? I worked on answering this question in one of the group projects in my masters program.

First, we cleaned the ingredient list by removing stop words, symbols, quantity markers, and kept only nouns. This is how it looked:

Before cleaning: 
['(2 oz.) tomato sauce', 'ground black pepper', 'garlic', 'scallions', 'chipotles in adobo', 'dried thyme', 'instant white rice', 'cilantro leaves', 'coconut milk', 'water', 'red beans', 'chopped celery', 'skinless chicken thighs', 'onions', 'lime wedges', 'carrots']
After cleaning:
['tomato sauce', 'pepper', 'garlic', 'scallion', 'chipotle adobo', 'thyme', 'rice', 'cilantro leaf', 'coconut milk', 'bean', 'celery', 'skinless chicken thigh', 'onion', 'lime wedge', 'carrot']

Next, we used Gensim’s Word2Vec implementation to get a 300-length vector for each unique ingredient in the data. We form a vector for the recipe by adding the vectors of its ingredients. Therefore, we got one vector for each recipe in the data. Let’s see how those vectors look when converted to 2-D representation using tSNE:

Vectors Represented in 2-D

While there is a lot of noise, we do see some clusters. For instance, Chinese, Thai, Japanese and Korean cluster together in bottom left. An interesting fact I discovered just now while writing this: Moroccon and Indian cuisine share some similarities! And this is evident from the bottom right portion of the above image.

Moving on, we used this vector-representation, which supposedly has “context” of the recipe, for classification. A cleaned recipe list and context vector representation gives classification accuracy of only 65% using simple linear regression, which is lower than the baseline 78%.

Sad, eh?

I thought let’s look in detail at some cuisines and their top ingredients.

I observed that the recipes which were actually French, were misclassified as Italian the most. I thought that’s okay, considering there is class imbalance and Italian cuisine has highest presence ratio in the data. However, a notable observation is that Italian cuisine were misrepresented as French cuisine the most. This was surprising considering French cuisine was not among those having high presence ratio in the data. Also, Italian was not misidentified as Irish cuisine in high numbers, but French was!

We, therefore, took a closer look at the top ingredients of these three cuisines.

Top Ingredients for French, Italian and Irish Cuisine

It can be observed that French cuisine shares two ingredients, oil and clove, with Italian cuisines. French cuisine also shares two main ingredients, cream and butter with Irish cuisine, albeit a tiny fraction of butter also contributes to Italian. Interestingly, Italian cuisine and Irish cuisine do not share many ingredients other than the ones shared by all three cuisines, which are pepper, onion, egg and others. This further strengthens our observation that French cuisine has similarities with Italian as well as Irish cuisine, however, Italian and Irish cuisine is not as similar. Therefore, misclassification among Italian and French cuisine is natural, since they closely resemble each other based on the ingredients used in their recipes.

So, since the Classification Accuracy fell from 78% to 65%, is Word-Embedding bad?

Probably.

Probably Not.

There are caveats.

For starters, the data has imbalanced classes. Italian and Mexican cuisine occur more frequently than others. However, this must have been the problem while estimating baseline too!

Possibly, a more thorough (or maybe less thorough?) cleaning of data is needed. Our initial thought was to keep only nouns in the data. Maybe we need to keep other parts of speech too. Maybe Adverbs and Conjunctions, which did not make sense to keep in the data, did play a part?

Maybe the vectors need more tuning. The hyperparameters used to make the word-vectors may need to be optimized. We added the individual word-vectors in a recipe to form the recipe-vector. Maybe we need to normalize the recipe vector, considering the number of ingredients in a recipe vary considerably.

Finally, it should be noted that the explanation of cuisine composition above shows that the vectors are indeed successful in clustering similar cuisines together. I also explained the reasons for misclassification. This hints to the possibility that word-embeddings are good for clustering applications. While for classification, more work is needed.

Anyway, getting poor results is also good research, right?

Find Code @: https://github.com/mohannishant6/Recipe-Ingredients-as-Word-Embeddings

Connect with me on LinkedIn!

Check out some of my interesting projects on GitHub!

--

--