Ruining Sudoku — A Data Science project (Part 3: Digits recognition and Sudoku solver)

Matteo Barbieri
Towards Data Science
7 min readOct 26, 2020

--

The number five. Or, according to every classifier trained on MNIST ever, a six. Or maybe an eight, or Batman’s symbol, because why not. Photo by Ryan Johns on Unsplash

All right, this is the one article that you really cared about, where GPUs heat up and loss curves are flatlined: today we’ll discuss the two core parts of the project, that is how to identify which of the nine digits is present in each one of the Sudoku cells, which we identified and separated beforehand (check the second article of this series for details about that), and from there proceed to solve the puzzle. Oh, and we’ll use the 2020 equivalent of magic: AI.

But first, a quick recap of how we got to this point.

Previously (and next week), in this series

Photo by Frank Vessia on Unsplash

We started with a relatively simple idea: designing and implementing a system that takes as an input a picture where a Sudoku grid is clearly visible (but not necessarily aligned/centered) and returns the solution to that puzzle.

We spent a decent amount of time organizing our work so that we had at least a rough idea of how difficult the various tasks would have been (and therefore how much time we would have had to allocate to each one of them), and we tackled the first major work package, data preprocessing.

Now we’re at the point where we trust that our system properly processes the input images and returns the 81 cells of the grid neatly cleaned and separated: time for the fun stuff.

Step 2: Solving Sudoku

If you’re reading this article, you’re probably quite good at math and therefore noticed that I put step 2 before step 1, despite 2 being unanimously considered greater than 1. I swear there’s a reason for that.

If you remember, while planning I assigned an estimated difficulty level to each of the subtasks, with red being used for those ones that I know little or nothing about. The solver is one of those. While I was almost sure there was a solution for that in the form of an algorithm, out there in some Github repo, I hadn’t yet properly searched for it, so I preferred not to give it for granted and work on this part before focusing on the digits recognition task, which on the other hand I thought I knew how to tackle. Oh, the irony.

The final project organization diagram. Colors indicate the estimated difficulty.
Image by author

Anyway, a brief search was sufficient to find a snippet which does exactly that, so the purpose of this whole section is just to highlight the importance of focusing on what you’re less confident about first, in order to get rid of that uncertainty as early as possible by either finding a solution to it, such as in this case, or finding some workaround.

Step 1: Identifying digits

The only missing block at this point is the digits classifier, once that is plugged into our system the project is essentially over.

I expected this to be a relatively easy part: everyone who has approached Deep Learning, and CNNs specifically, had to go through the initiation ritual of training a model on MNIST, the infamous handwritten digits recognition dataset, and can produce such a classifier in their sleep, so I did exactly that.

Unfortunately, while performing quite well on MNIST’s test set, the results on the actual images that are produced in the previous step of the pipeline (the isolated cell grids containing a single digit) were quite disappointing, to say the least. Sure, it still got it right most of the time, but keep in mind that in order for this system to work even one mistake leads to incorrect results. I had hit the first real road bump in this project and had to deal with it.

Two steps troubleshooting

It sucks, but these things happen. To address issues like this one, I try to answer these two questions:

  1. Why is it not working?
  2. What can I do about it?

To answer the first question and figure out why I was getting such poor performances (relatively speaking) I analyzed the images that were incorrectly classified, noticing that most of the mistakes happened on digit 1, which was often (incorrectly) labeled as 7. In other cases, the image had some kind of noise (small patches of non-black pixels) that had not been removed by the preprocessing step which apparently was enough to confuse the model and produce a wrong prediction, despite the digit itself appearing clearly in the image!

On the left, a sample of digit 1 from the MNIST dataset. On the right, a 1 extracted from an actual sudoku grid. Despite being quite similar, a classifier trained on MNIST will probably fail to classify the one on the right correctly, predicting a 7 instead. Image by author

To be honest, this did not really surprise me: I have severe issues with the MNIST dataset, I literally wrote a >1000 words rant about why it is garbage on fire. The short version is that while being a nice introduction to image classification problems, it does not really work that well when applied to a kind of problem that is just slightly different (digitally printed digits instead of handwritten ones) from the one presented in most tutorials available online.

The solution I devised is based on the holy principle that you should do your best to train a model on data that is as similar as possible to the data on which you want that model to work. In case you’re looking for a quote for your next forehead tattoo, please consider the previous sentence.

What I did was generate a synthetic dataset using the Python PIL package to draw the individual digits on a square canvas in slightly different sizes and positions. For every image, a font was chosen from a list of those most similar to the ones used in magazines, to improve the generalization of the model. Moreover, these images are created on the fly on model training, so no disk space is required as they are not saved.

This way of generating data, coupled with some standard data augmentation (just rotation actually, plus some custom salt and pepper noise to make it resistant to defects in the images), produced a model that, while not perfect, yielded significantly better results.

There’s one minor detail left to discuss: a digit classifier recognizes, well digits from 0 to 9, but a Sudoku cell either contains a digit from 1 to 9 or is empty. The simplest way I found to handle this situation is to first use some heuristics to see if there is anything in the cell (by simply counting how many pixels are different from the background color and deciding on some reasonable threshold) and only if there is something proceeding with the actual classification. This has the additional advantage of avoiding wasting quite a lot of time and resources on useless computation (counting white pixels is much cheaper than running inference on a CNN).

Boring (but useful!) tools/logistics blabla

Before I go back to my Factorio megabase, a few more words on the tools that I used in this project.

I really don’t want to get into the whole text editor/IDE gang war, too much blood has been shed already. I personally use Vim (with a huge amount of plugins, check my .vimrc in case you’re interested) because I’m super old but I have no strong opinion on this topic, use whatever makes you happy.

On the other hand, as of recently I have been consistently using a tool for experiment tracking/bookkeeping for keeping track of the experiments that are running/have completed. Even for a super small project such as this one, I suggest using it as it will make your life much simpler at the cost of a few additional lines of code. There are several tools out there, I personally love and feel like suggesting Sacred (plus Omniboard for visualization).

Finally, and this is an easy one, Jupyter notebooks are perfect for data exploration and quickly prototyping things. Use them!

Done!

That was it! For real, this time! The pipeline is now complete, we have a clear path from the original input image to the puzzle solution. Next week we’ll complete the project by packaging it in the form of a web app using Docker.

You can find the notebooks and code for this series of articles in this Github repo.

See you in seven days!

--

--

Data scientist, Deep Learning enthusiast, Python lover. Also problem solver, things misplacer and synthwave listener. Advancing humankind @ Peltarion.