Fun with AutoDiff
I stumbled upon the puzzle of the Three Dutchmen. Being somewhat familiar with automatic differentiation and gradient descent, that immediately gave me an idea. It would not be easy, of course, but how hard can it be?
Thanks to Dr. VK for introducing me to this fun puzzle of the Three Dutchmen.

About
This post describes one way to solve the puzzle of the Three Dutchmen using gradient descent. It is not the most efficient method for solving it. It is just for fun 🙂 An application further down the post lets you play with the problem using different optimizers.
We will not delve into the details of automatic differentiation nor gradient descent but instead play with them to solve the Math puzzle. It shows a silly way to use gradient descent outside of machine learning.
The Dutchmen and Their Hogs

The Dutchmen is a mathematical puzzle. Here is a short version:
There are three married couples. They are:
- The husbands: Hendrick, Claas, and Cornelius– The wives: Geertruii, Catriin, and Anna
You do not know the pairing of the couples, but here are some hints: 1) Hendrick bought 23 hogs more than Catriin 2) Claas bought 11 hogs more than Geertruii 3) shillingsPaidByPerson(hogsBoughtByPerson) = hogsBoughtByPerson² 4) Each husband paid 21×3 shillings more than his wife
Q: Who is married to whom?
We will solve this by using automatic differentiation and gradient descent, resembling the training of a neural network. We will need the following:
- when given a set of hog counts for each person, measure how far away/close it is from the answer: an error function
- figure out how each hog count affect the error from the error function: automatic differentiation
- a way to modify the input hog counts, knowing their effect on the error: a gradient descent algorithm, an optimizer
The application embedded in this post contains all of this. You may try it out further down the post.
The Golden Differentiation Hammer

Automatic differentiation, or simply autodiff, is an algorithm for calculating the derivative of a function[1]. Given a differentiable function with a set of input variables and the resulting output value, it can calculate the partial derivatives of the output with respect to each input, doing so without needing any manual derivation code. I like to think of these derivative values as measurements of how each input influenced the output.
Autodiff is, in my view, a great hammer. There are many fields[2] where it has an important role. You might be familiar with its application for calculating the derivative[3] when training neural networks. We may also call the derivative values for the gradient[4] when there is more than one value.
In our case, given a set of hog counts for each person, we need to find out:
- How large is the error?
- Most importantly, how did the hog count of each person affect the error?
A sensible error function combined with autodiff can answer both of these questions. The skeleton of our error function can be defined like this:
The calculations in this error function can be an implementation of all of the hints. We can also add more rules that fit the problem, as we shall see.
Shiny Descending Nails

Autodiff calculates a derivative value for each input value, collectively called the gradient. It informs us of the effect each input variable had on the final output from the main calculation, in our case, the error value from our error function. On its own, it does nothing. We need to somehow apply it back on the input hog counts.
There are many different algorithms for applying the gradient back on the input variables[5]. One common theme for these algorithms, or optimizers, is that they work iteratively in small steps, changing the input variables by tiny amounts each time. When done right, the error will decrease over time. Here is an example of a simple Optimization/training loop:
We will try three different optimizers. They are Gradient Descent with Momentum, RMSProp, and Adam, each having its own characteristics. Discussion of optimizers is beyond this post, but you can try them out in the application. You may get a feel for each, at least in relation to this Puzzle.
Multi-dimensional Sea of Error

Our problem consists of six variables, one hog count for each person (ignoring matchmaking of the couples for now). Say that we have a six-dimensional space. Each count can then be mapped to its own axis in this hyperspace. A point in this space will represent a possible set of hog values. Applying the error function on each point gives us a hyperspace where the value at each position represents the amount of error for that given set of hog counts.
When using gradient descent, we can view this space as a multidimensional sea of error. We pick a random point where we place our imaginary ship. The gradient from autodiff is used for steering. It has this unique property that points us towards lower error waters from our current location. We surf up but mostly down multidimensional waves, hills, and valleys of error. Hopefully, we will end up at a spot with the lowest value within our tiny universe of error.
Autodiff and gradient descent can train neural networks with thousands and even millions of variables. These measly six variables should not be a match for these ultimate techniques, right?
Putting A Nail in the First Idea

The relationships between the husbands and wives are crucial components to the answer. I tried to model it using softmax values, such that each husband would end up with the one and only "correct" wife. It was challenging to find the right pairs during optimization, and I could not get it working. We will skip this failed attempt to keep this blog post shorter.
Instead, we will model the problem in a more straightforward fashion by hard-wiring all possible relationships.
Closing in on the Hogs
With three males and three females, there is a total of six possible pairings:
We should find the correct pairings by solving for all the alternatives above. All six pairing options will be optimized, but we will focus only on the 5th since it is the correct pairing from the known solution [7]. The other pairings will not converge to a solution, but the embedded application later in the post allows you to investigate them if you are interested.
Let us review the hints again:
1) Hendrick bought 23 hogs more than Catriin 2) Claas bought 11 hogs more than Geertruii 3) shillingsPaidByPerson(hogsBoughtByPerson) = hogsBoughtByPerson² 4) Each husband paid 21×3 shillings more than his wife
The first three hints work the same way, no matter the pairing of the couples. The relationships only impact the last hint "Each man paid 21×3 shillings more than his wife".
The error function is expressed by the sum of the squared errors from all the possible pairings. Each pairing possibility calculates an error based on all four hints:
The error function will output zero if our input hog counts and the wife-husband relations are correct, or a positive value otherwise.
The Dodging Hogs

Now that we have the error calculation, we can write the training/optimization loop:
It is always exciting to train a model for the first time, which seldom works. I was cautiously optimistic. Here is the result:

The animation shows the error graph for the pairing of the real solution. The error is close to zero but never quite reaches it (the label only shows two decimal places). Each married couple is shown in their respective error map. The x and y axes represent the husband’s and wife’s hog count, respectively. Each map is colored by calculating the error value using the coordinate as the hog counts for the pair. The other couples are kept constant at their value for the current iteration. The color is not the plain error but is the minimum error in view. In other words, black is the minimum error within the shown region (we will move around in these maps later). The tiny green squares show the known correct answer for each couple.
Ideally, we should look at a six-dimensional error map. Nevertheless, I hope these 3 x 2d maps can give us a somewhat helpful view of the situation.
Needless to say, it did not work. I tried with different optimizers and also by varying their parameters. I could not find the solution. You may try for yourself:
By zooming way in, we can see this:

As mentioned, the real problem lies in a six-dimensional space. It is thus not entirely correct to look at three 2d maps, but we can try to understand the issue using them. It seems like all three couples have found themselves within a "local minimum"[6]. My hypothesis: The couples are in a pocket with a low error value. There is no place to move them from their current position that will reduce the error further. Both I and the solution were stuck.
Better Hog Targetting
Our hog counts stopped converging toward a solution. It is probably because our error function could not distinguish enough between the real answer and the first result. We need additional requirements in the error function.
We know that the minimum hog counts must have some specific values to be valid, and the same goes for the payment. With that in mind, we can add a new requirement. It will force the costs to be in line with the hints when they are below the valid threshold. The 4th hint, "husband pays 63 more than his wife", has minimum valid hog counts (8, 1) for (husband, wife), respectively, which results in price values of (64, 1) for a couple, giving us the requested difference of 63.
The improved error function:
Nailing the Hogs

Optimizing with the modified error function looks like this:

Finally, we got the solution:
- (32, 31) hogs for (Hendrick, Anna)
- (12, 9) hogs for (Class, Catriin)
- (8, 1) hogs for (Cornelius, Geertruii)
All optimizers manage to find the answer, some faster than others. Unlike training neural networks, where Adam is one of the faster-converging optimizers, here we have the standard gradient descent with momentum that is the fastest. It may well be I who did not find suitable parameters for the optimizers (or maybe introduced bugs in my code). Even though gradient descent with momentum is faster here, it requires careful picking of its parameters (learning rate and momentum). Both Adam and RMSProp are less sensitive to their parameters.
Try to find the solution yourself:
The Aftermath
When all the nailed hogs had been carried away, and the market cleaned up, we finally got our answer. Using gradient descent for solving this puzzle may be untraditional and a bit convoluted, especially when we, with modern tools, can brute-force it even using a mobile device. The preferable method is no doubt to use some analytical skills.
But no matter if it is convoluted or not, one might view this problem as a series of equations and an error function. That tells me that there is a chance that the answer can be found by optimization. It was not easy, at least for an amateur autodiff enthusiast like me 🙂 It was fun that it worked out in the end. Whenever you are just having fun, it is entirely acceptable to use your shiny golden hammer.
About the Implementation
Automatic differentiation requires extra data for each variable and value. In a calculation like "a + b", we need to mark both variables with additional information. The implementation does this by modeling these variables as objects.
All pseudocode in the main text shows calculations using regular math operators. The implementation is done in javascript, which does not support operator overloading, and hence "y = a + b" must be programmed like this:
Just keep that in mind if you examine the code in the (code) pens.
The code for automatic differentiation can be viewed from the resources of the pens. It is a simple implementation and only supports scalar values. You may read about it here:
Bonus: Solving with Brute-force
I mentioned that the answer can be brute-forced with today’s equipment.

A simple program can go through all possible hog count values for every person and do so for all six possible pairing options. The application below does precisely that. One addition is early stopping; the code stops trying the rest of the permutations if the few values it already got, say for only two of the participants, violate the requirements. Otherwise, it is a straightforward implementation.
Try the thrill of a button-click and watching a progress bar:
Sorry for the boring app. Maybe it can serve as a crude performance benchmark? 🙂
I wonder how the creator of the puzzle, from 1739 [7], would react to the puzzle being brute-forced within a matter of seconds, with a device fetched from our pocket. We live in amazing times.
[1] https://en.wikipedia.org/wiki/Automatic_differentiation
[2] http://www.autodiff.org/?module=Applications
[3] https://www.jmlr.org/papers/volume18/17-468/17-468.pdf
[5] https://ruder.io/optimizing-gradient-descent
[6] https://en.wikipedia.org/wiki/Local_optimum
[7] https://www.cantorsparadise.com/three-dutchmen-f9f08ac19d73