The world’s leading publication for data science, AI, and ML professionals.

A Two Stage Stage Approach to Counting Dice Values with Tensorflow and Pytorch

Built a two stage Tensorflow and Pytorch model to count dice values and found that it out performs my previous single detection model

In a previous blog I discussed how I built a 12 class dice detector in Tensorflow using around 400 annotated images as training data. The goal of the model was to detect the presence of a dice face for a 6, 8, 10, or 12 sided dice and then also determine what the face value was. Once this was done I could get the total value of the dice on the screen.

This model did decently at the task, but had the issue of either not identifying dice faces or misclassifying the face values that it did detect. So at the end of that previous post I mentioned that the other approach that I would take to this problem is to build a first stage object detector that specializes in just detecting dice faces and then a second stage CNN which could use the outputs from the first model to determine the numbers. While this adds additional complexity to the training and implementation pipelines, I felt that it could improve the overall performance.

My reasoning for the potential performance improvements are based around the advantage of using a one size fits all generalized model versus breaking down the problem into smaller pieces and building models that specialize at specific tasks. In this case the first stage object detector can just learn to identify dice faces generically versus each type of dice individually. This means it gets more exposure to seeing dice from different angles since identifying faces on the 8 and 10 sided die pose similar issues depending on the direction. Then in the second stage CNN I can apply a large number of rotations and flips to augment the data much higher than I was able to in the generic object detection model I build previously to help it better identify dice values no matter the orientation.

New object detection model has a single class "dice_top" the boxes get passed to a backend ResNet model for classification.
New object detection model has a single class "dice_top" the boxes get passed to a backend ResNet model for classification.
gif of the same video from the previous blog rerun here. Two stage pipeline also gets a value of 20
gif of the same video from the previous blog rerun here. Two stage pipeline also gets a value of 20

You can find the code on github here. The scripts need to be used as part of the tensorflow object detection library, and the detection scripts I modified at various points for data preparation. The ones there are the ones I used to do the final labeling of images and video as seen in the post.

Training the New Object Detector

For this project I used the same dataset I used in my previous blog. For the object detection model I only used 200 images where I adjusted all the bounding boxes to have the label "dice_top" instead of numbers 1–12. I did this by quickly going through the xmls with Labelimg and adjusting the labels manually. Originally I tried to adjust the labels automatically in the generated csv from the xmls, but ran into strange model behaviors while training so I reverted to the manual method.

With the single class the model only had to run for an hour or so before I stopped it at a good threshold instead of the 6 or so hours required by the previous model.

At this stage I was able to evaluate how the new object detection model was preforming in comparison to its predecessor. Something that came up quickly was that it was detecting dice that the first model missed. The image below on the left is an output from the first model whereas the right image is the same image processed by the new single class detector. The 6 on the left is detected in the new one but left out in the first model.

Similar story in the following image pair. The d8 near the top was not detected by the first model, but is detected by the second.

Also both models show a weakness for 10 sided dice. In this case both models do poorly at detecting the blue 10 sided dice in the corner and in this case the first model fails more cleanly.

So while the first model does a bit better at failing to identify the top of the blue 10 sided dice, I think it is also good to note that the new single class detector was trained on half as much data but does a better job of detecting dice than its predecessor overall.

Now that the new dice detector was in place I could go ahead and start to build the second stage model for value classification.

Preparing Data for the Backend Model

The more interesting set of data preparations was getting the data for the backend CNN. I knew that I needed to train a model to identify the numbers on dice faces regardless of their orientation so I figured a CNN with heavy data augmentation in the form of random vertical and horizontal flips along with random rotations would be helpful. I decided to use a Pytorch backend model because it provides a nice simple pipeline to train and deploy its models (I have also just run too many Keras models and like the variety). Additional reasoning for using Pytorch is mostly just that I enjoy using Pytorch and have a good code base for this type of problem from my other blogs.

So to get the dice top images into my desired folder structure I had a few options.

  1. Crop a few hundred images of each class by hand: obviously slow and sub optimal. Also too much manual work for my taste.
  2. Use the new dice top model to crop out all of the dice tops and then sort them into folders. This also requires a good bit of work to sort. At this point I am guessing you all know where I’m going with this.
  3. Use my previously trained 12 class object detector to sort images into the 12 folders based on predicted class and then look through the dataset to find misclassifications.

I went with option 3 because it required the less mechanical work on my end and I found to be a clever solution for other problems in the past.

For this all I had to do was modify the script that I used to get bounding box outputs onto images to supply the box coordinates so that I could slice the image arrays to crop out just the dice tops in the image. Then using the predicted category from the bounding boxes I sorted the images into the different folders. This let me generate 2500 or so cropped images in a few minutes rather than however long it would have taken me to crop 2500 images by hand…

Next came the cleaning. So below is a screenshot of the 6s folder where there are a number of 9s hidden in the mix. With these dice 9s and 6s are typically marked with a line or dot near the bottom, for dice with no marking it means it is a 6.

Another interesting category was looking at 1s vs 7s.

This also let me get a good feel for the types of classification errors that my 12 class object detection model was making. Depending on the angles you would see 5s get marked as 2s or 6s and 11s or 12s as 10s, 2s, or 1s since the model had relatively little exposure to those numbers.

Once the dataset was sorted into the 12 folders I was ready to throw a few CNNs at it to see what worked.

Training the Pytorch CNN

With the dataset in place and broken out into folders by class it was pretty straightforward to leverage a pytorch CNN pipeline using standard dataset and dataloader functions. One thing to note here is that I made a validation set of 120 images (10 from each class). I made a balanced validation set even though there is a large imbalance towards values 1–6 since those show up on all of the dice while values 11 and 12 only show up on d12s so they surface the least.

What I found is that using pre-trained ResNet models worked fairly well and I just allowed for fine-tuning of weights over a few epochs. After around 15 epochs I had 95% validation set accuracy with a ResNet 50 and found that increasing it to a ResNet 101 increased it to 97% accuracy. I ended up sticking with the ResNet 101 since at the moment model footprint is not a huge concern of mine.

Putting them Together

Now that I had the two pieces all I had to do was put them together into a single pipeline. The first stage model would detect dice faces and feed those dice faces to the second model for classification. Then based on the classifications of that back end model I was able to add up the dice values on the screen and display them.

This was relatively straightforward and required the addition of initializing the Pytorch model and adding additional functions to preprocess the images, map the results to their label, and add up the results. The only annoying part was that I do not have a multi-gpu machine currently available to me and working with multiple models when some of them are Tensorflow based can get annoying.

For other projects I have used multiple large Pytorch based models on the same GPU like in my Fate Grand Order bots. However I have found that Tensorflow tends to allocate all available GPU resources to itself. So to run multiple models I have essentially juggled which one is active at any given point, but for this use case I just allocated the Pytorch CNN to run on my CPU. This slows down the evaluation, but was a quick fix for the time being. Next tech purchase will probably be building out a GPU cluster to make things like this less annoying.

It gets the correct 20, but It shows some fluttering to thinking the 6 is a 9 which is a problem I found with the backend classifier that I will discuss more below.
It gets the correct 20, but It shows some fluttering to thinking the 6 is a 9 which is a problem I found with the backend classifier that I will discuss more below.

Comparing the two Approaches

When looking at these two approaches I think that the main tradeoffs are the ones between speed and accuracy.

The first approach with a single model that I can run on my single GPU evaluates much faster than my GPU + CPU approach. However having the second pipeline running on two GPUs will likely minimize this difference.

Now for the fun one, looking at performance between the two. For this I held out 25 examples from both the training and validation sets, I guess you could call this the test set and ran both models over them. For this I just looked at if they got the correct total dice value and if they did not I noted the types of errors they made.

The first model got 8 of the 25 completely correct (32%) whereas the two stage pipeline got 16 out of 25 completely correct (64%). So my general hypothesis in doing this blog seems to be correct, even though both still have room for improvement.

The errors that the first model made were generally not detecting certain dice or by misclassifying certain numbers as I stated above. Common ones are 6s vs 9s, 1s vs 7s, and a number of 8s vs 3s vs 6s. The below image has 3 dice that were not detected and a 3 that was classified as a 6.

Image from the first single object detection model
Image from the first single object detection model

The two stage pipeline outperforms the first model by identifying the 3 dice that were missed and correctly classifying all of the dice faces for a value of 41.

However, the second model still has issues and in the 25 cases I looked at the issues it had were ones where it did not detect the faces on the 10 sided dice correctly (which both models did poorly at) and then misclassifying 6s as 9s. To fix this I would have to check to make sure I didn’t have any issues in the 6 and 9 folders of the datasets and then if that does not fix the issue, then augment the data with more images of 9s and 6s. Between these two mechanisms I think that would help fix a lot of the issues. For misidentifying the faces on the 10 sided dice I think that adding more data for that particular dice would be useful.

True value should be 41, but is off by 3. I think it comes down to the 6 sided die that does not have a marking on it. The fact that this is an issue makes me think I have some of those dice examples in both the 6 and 9 folders which poisons the training data.
True value should be 41, but is off by 3. I think it comes down to the 6 sided die that does not have a marking on it. The fact that this is an issue makes me think I have some of those dice examples in both the 6 and 9 folders which poisons the training data.

Final Thoughts

So in this post I showed how I would build the two stage pipeline that I previously mentioned in my first dice counting post. In addition I am happy to report that the two stage pipeline does indeed outperform my first model and does a better job of both detecting dice faces and then categorizing the faces. For me this is another instance where building well specialized models out performs a generic approach. For some more intuition on why this works something from my martial arts background comes to mind. It gets summed up in a quote from Bruce Lee.

My personal version of this was when my first sword master told me that I should "practice a single technique 1,000 times to understand it, and 10,000 times to master it".

Neural Networks are similar in that they typically take thousands of examples in order to converge on a good solution and learn to identify a specific class. So streamlining the things that they have to optimize for lets them better optimize to a specific problem and learn its ins and outs. In a technical sense, For my 12 class detector, it may only see a specific dice/dice face a few dozen times and must learn to localize and categorize it based on those few examples. However the single class detector sees its single class 2500 times.Then the second stage model may only see each class a few dozen to a few hundred times, but it augments the data significantly to vastly increase the amount of effective training data it has at its disposal. This lets it get much more experience than the basic object detection model at classification.

This added specialization comes at the added complexity of training more models and implementing them in a pipeline. In this case the two stage pipeline requiring 2 GPUs to run smoothly instead of just 1 GPU for the single model. Technically I could just try not using Tensorflow… since I have run 3–4 large ResNets at once before using Pytorch. The value of this speed vs accuracy trade-off all comes down to what level of accuracy vs speed you need for a specific task.

I had an extra gif, so here is some more dice rolling!
I had an extra gif, so here is some more dice rolling!

You can find the code on github here


Related Articles