AI and Creating the World’s Ultimate Chocolate Chip Cookies

Dabbing for Data
Towards Data Science
6 min readAug 16, 2017

--

by JD

As a self proclaimed chocoholic, I rely on a lot of chocolate to get me through my days, my favorite delivery system being chocolate chip cookies. I am always on the lookout for new ways to bake my favorite food, but there are so many recipes online, each one claiming that they are “the best” — how do I choose? Instead of picking one, I decided to apply some data science and machine learning techniques to combine several of “the best” recipes to make “the ultimate” recipe.

The Data

The hardest part of this project, as is usually the case, was obtaining and cleaning the data. Using my favorite language, Python (along with BeautifulSoup), I scraped allrecipes.com for as many chocolate chip cookie recipes as I could get my hands on (All Recipes eventually blocked me because I sent to many requests:( but I still got 30 recipes! suck it). Since the procedure for baking chocolate chip cookies remains relatively constant regardless of the ingredients, I only focused on the composition of ingredients rather than the process. Just put the ingredients in a bowl and stir it duh. For each recipe, I parsed the html then scrubbed and scrubbed and scrubbed the data to give me a list of ingredients and their respective measurements. From this, I encoded each recipe as a normalized vector in n-dimensional space, where n was the number of unique ingredients across all recipes. For example, the recipe:

1 cup butter, softened
1 cup white sugar
1 cup packed brown sugar
2 eggs
1 teaspoon vanilla extract
2 cups all-purpose flour
2 1/2 cups rolled oats
1/2 teaspoon salt
1 teaspoon baking powder
1 teaspoon baking soda
2 cups semisweet chocolate chips
4 ounces milk chocolate, grated
1 1/2 cups chopped walnuts

Might look like the vector:

0.102564102564 ,
0.0512820512821 ,
0.0512820512821 ,
0.0512820512821 ,
0.0512820512821 ,
0.102564102564 ,
0.205128205128 ,
0.0512820512821 ,
0.102564102564 ,
0.025641025641 ,
0.102564102564 ,
0.0512820512821 ,
0.0512820512821 ,

In reality, our vectors will be more sparse, that is, they will have zeros for any corpus ingredient not in the given recipe, but you get the point.

Finally! The data had gone from rough and dirty to digital and clean. Let the fun begin!

Exploratory Analysis

When trying to combine recipes together to make a better one, a technique that comes to mind is clustering, which 99% of the time implies K-Means. I wanted to use the cluster centers, which combine information from several recipes, to create new recipes. K-Means is easy to understand and easy to implement, but the difficult step is choosing K, the number of clusters. If you look at a two dimensional plot like the one below, it is often easy to see the number of clusters.

But how to pick the best number of clusters when you have 4 dimensional data? 5? 68? 70? In the case of cookies, this part is just as much art as it is science. We want a diversified set of recipes, but we don’t want recipes that are too close to the original ones. To solve this problem, I turned to a slightly different type of clustering called Hierarchical Clustering. This method looks at the distance between clusters as the cluster size gets smaller and smaller. The resulting visualization is pretty awesome:

From this we can immediately see that there are a few outliers a recipe indices 1 and 22 (and possible 24, 0, and 7). Let’s see what these recipes look like:

Recipe Index 1:
3 egg whites
3/4 cups semisweet chocolate chips
4 and 1/4 tablespoons unsweetened cocoa powder
1 and 1/2 teaspoons vanilla extract
3/4 cups white sugar
Recipe Index 22:
1 cup butter
2 and 1/4 cups chocolate cake mix
4 eggs
2 cups semisweet chocolate chips

Who uses chocolate cake mix for cookies??!!??? It doesn’t take a chocolate chip cookie aficionado to see the problem here. These recipes are outliers, or should I say “out-liars” and the clustering algorithm caught them dead in the act. If we ignore these posers, there seem to be three clusters of data, give or take:

Three seems like a good number of clusters! Now we can go ahead and run K-Means. With the idea of using the model centers as the basis for recipes, we can apply an vector-to-text function to get a textual description of these new AI generated recipes. This function takes a vector and scales it up so that the numbers are a reasonable size and rounds to the nearest 1/4 — this way we can have 2 1/2 cups of flour in our recipe and rather than 2.469 cups. Now, let’s see what the AI chef cooked up today!:

Recipe #1: “Humpty Dumpty”

1 and 1/2 cups all-purpose flour
1/4 teaspoons baking powder
3/4 teaspoons baking soda
1/4 cups butter
1 egg
1/4 egg yolk
1/4 teaspoons ground cinnamon
1/4 large egg
1/4 cups macadamia nuts
1/2 cups packed brown sugar
1/2 teaspoons salt
1 and 1/4 cups semisweet chocolate chips
1/4 cups sifted all-purpose flour
1/4 cups unsalted butter
1 tablespoons vanilla extract
1/2 cups white sugar

Hmm this recipe calls for a lot of different types of butter, and eggs in strange proportions. The text cleaning didn’t take into consideration small differences in ingredient names, so “butter” and “unsalted butter” got separated. Also, who uses 1/4 egg yolk? Overall, though, this recipe doesn’t seem half bad! The concoction of ingredients seems reasonable, and even includes a hint of cinnamon!

Recipe #2: “Sugary Oats”

1 and 1/2 cups all-purpose flour
3/4 teaspoons baking soda
1/2 cups butter
1 and 1/2 eggs
1/4 cups packed brown sugar
1/4 cups packed light brown sugar
1/4 cups quick-cooking oats
1/4 cups rolled oats
1/2 teaspoons salt
1 cups semisweet chocolate chips
1 tablespoons vanilla extract
1/4 cups water
1/4 cups white chocolate chips
1/2 cups white sugar

Ah yes, this has everything you need for a great recipe: sugar, chocolate, and more sugar. Again, the ingredient name difference problem shows up here. Perhaps in a future version I could include a feature that groups together ingredients if their names are similar, maybe using a DP algorithm, but that’s another story.

Recipe #3: “The Minimalist”

3/4 cups all-purpose flour
3/4 teaspoons baking powder
4 cups butter
1/4 cups confectioners' sugar
1 and 1/4 eggs
1/4 teaspoons salt
1 and 1/4 cups semisweet chocolate chips
3/4 tablespoons vanilla extract
3/4 cups white sugar

Of the three recipes created, this one is my favorite! It has a reasonable composition of ingredients that makes sure to include the essentials (though I would note that you can never have enough chocolate chips in your cookies).

Final Thoughts

Using the K-Means algorithm for selecting ingredients and their proportions works well for foods and drinks in which the method for combining the ingredients is relatively simple. However, for foods in which the order of steps matters, a more sophisticated algorithm must be used. But that’s a story for another time.

--

--