AI for good: Banknotes detection for blind people

A guide to make a state-of-the-art banknote detector using Deep Learning Artificial Intelligence

Published in

Towards Data Science

8 min readAug 26, 2019

This service recognizes what currency a banknote is (euro or usd dollar) and what denomination (5,10,20, …). The social impact purpose is to help blind people, so I took care to make “real-life” training images holding the banknotes in my hand, sometimes folded, sometimes covering part of it.

This post hopefully helps encourage others to learn Deep Learning. I’m using the amazing fast.ai online free course, which I very much recommend. As a testament to their pragmatic, top-down approach, this side-project is based on lesson 3. On their online forum you can find many more amazing applications by fellow students.

My project is deployed on iris.brunosan.eu

Here’s an example inference:

Amazingly, the fast, fun and easier part is the deep learning part (congrats fastai!), and the production server took roughly 10x time (I also had to learn some details about docker and serverless applications).

The challenge

I found just a few efforts on identifying banknotes to help blind people. Some attempts use Computer Vision and “scale-invariant features” (with ~70% accuracy) and some use Machine Learning (with much higher accuracy). On the machine learning side, worth mentioning one by Microsoft research last year and one this year by a Nepali programmer, Kshitiz Rimal, with support from Intel:

Microsoft announced their version at an AI summit last year, “has been downloaded more than 100,000 times and has helped users with over three million tasks.” Their code is available here (sans training data). Basically, they use Keras and transfer learning, as we do in our course, but they don’t unfreeze for fine-tuning, and they create a “background” negative class with non-relevant pictures (I find odd to create a negative class… how can you learn “absence features”). They used a mobile-friendly pre-trained net “MobileNet” to run the detection on-device, and 250 images per banknote (+ plus data augmentation). They get 85% accuracy.
The nepali version from Kshitiz: 14,000 images in total (taken by him), and gets 93% accuracy. He started with VGG19 and Keras for the neural net, and “Reach Native” for the app (This is a framework that can create both an iOS and Android app with the same code), but then he switched to Tensorflow with MobileNetV2 and native apps on each platform. This was a 6 months effort. Kudos!! He has the code for the training, AND the code for the apps, AND the training data on github.

My goal was to replicate a similar solution, but I will only make a functioning website, not the app, nor on-device detection (I’m leaving that for now). However, I am going to use a different architecture. Since I want to do several currencies at once, I wanted to try multi-class classification. All the solutions I’ve seen use single-class detection, e.g. “`1 usd`”, and I wanted to break it into two classes, “`1`” and “`usd`”. The reason being that I think there are features to learn across currencies (all USD look similar) and also across denominations (e.g. the 5usd and 5eur have the number 5 in common). The commonalities should help the net reinforce those features for each class (e.g. a big digit “5”).

The easy part, Deep learning

I basically followed the fast.ai course lesson on multi-classfor satellite detection, without really many modifications:

The data

It is surprisingly hard to get images on single banknotes in real-life situations. After finishing this project I found the academic paper with Jordan currency I mentioned above, and the Nepali project which both link to their dataset.

I decided to lean on Google Image searches and images from the European Central Bank and the US Mint, which I knew was going to give me unrealistically good images of banknotes. I also took some pictures myself with money I had home, for the low denominations (sadly I don’t have 100$ or 500eur lying around at home). In total I had between 14 and 30 images per banknote denomination. Not much at all. My image dataset is here.

Training images for identifying banknotes, a combination of stock photos and some I made.

Since I didn’t have many images, I used data augmentation with widened parameters. (I wrongly added flips, it’s probably not a good idea):

tfms = get_transforms(do_flip=True,flip_vert=True, 
 max_rotate=90, 
 max_zoom=1.5, 
 max_lighting=0.5, 
 max_warp=0.5)

In the end, the dataset it looked like this during training/validation:

Note the effect of data augmentation, rotating, flipping and cropping the images.

It’s amazing one can get such good results with that few images.

The training

I used 20% split for validation, and 256 pixel size for the images, and `resnet50` as the pre-trained model. With the resnet frozen, I did 15 epochs (2 minutes each) training the added top layers, and got an fbeta of `.087`, pretty good already. Then unfroze and did more training with sliced learning rates (bigger rates on the last layers) on 20 epochs, to get `fbeta=.098`. I was able to squeeze some more accuracy by freezing again the pre-trained model and doing some more epochs. The best was `fbeta=0.983`. No signs of over-fitting, and I used the default parameters of dropout.

Exporting the model and testing inference.

Exporting the model to PyTorch Torch script for deployment is just a few lines of code.

I did spend some time testing the exported model, and looking at the outputs (both the raw activations and the softmax. I then realized that I could use it to infer confidence:

positive raw activations (which always translate to high softmax) usually meant high confidence.
negative raw activations but non-zero softmax probabilities happened when there was no clear identification, so I could use them as “tentative alternatives”.

e.g. let’s see this problematic image of a folded 5usd covering most of the 5

A difficult image, folded, and covering most of the number 5 that correctly identifies the denomination.

{‘probabilities’: 
 ‘classes’: [‘1’, ‘10’, ‘100’, ‘20’, ‘200’, ‘5’, ‘50’, ‘500’, ‘euro’, ‘usd’]
 ‘softmax’: [‘0.00’, ‘0.00’, ‘0.01’, ‘0.04’, ‘0.01’, ‘0.20’, ‘0.00’, ‘0.00’, ‘0.00’, ‘99.73’], 
 ‘output’: [‘-544.18’, ‘-616.93’, ‘-347.05’, ‘-246.08’, ‘-430.36’, ‘-83.76’, ‘-550.20’, ‘-655.22’, ‘-535.67’, ‘537.59’], 
 ‘summary’: [‘usd’], 
 ‘others’: {‘5’: ‘0.20%’, ‘20’: ‘0.04%’, ‘100’: ‘0.01%’, ‘200’: ‘0.01%’}}

Only the activations for class “`usd`” positive (last on the array), but the softmax also correctly brings the class “`5`” up, together with some doubt about the class `20`.

Deployment

This was the hard part.

Basically you need 2 parts. The front-end and the back-end.

The front-end is what people see, and what it does is give you a page to look at (I use Bootstrap for the UI), the code to select an image and finally displays the result. I added some code to the front-end to downsample the image on the client using Javascript. The reason being that camera pictures are quite heavy nowadays and all the inference process needs is a 256 pixel image. These are the 11 lines of code to downsample on the client. Since these are all static code, I used github pages on the same repository. To send the image to the server I pass them directly as DataURI instead of uploading them somewhere and then pulling it from there.
The back-end is the one that receives the image, runs the inference code on our model, and returns the results. It’s the hard part of the hard part :), see below:

I first used Google Cloud Engine (GCE), as instructed here . My deployment code is here, and it includes code to upload and save a copy of the user images with the inferred class, so I can check false classifications use them for further training. Quite neat.

Overall it was very easy to deploy. It basically creates a docker that deploys whatever code you need, and spins instances as needed. My problem was that the server is always running, actually 2 instances at least. GCE is meant for very high scalability and response, which is great, but it also meant I was paying all the time, even if no one is using it. I think it would have been 5–10$/month. If possible I wanted to deploy something that can remain online for long without paying much.

I decided to switch to AWS Lambda (course instructions here). The process looks more complicated, but it’s actually not that hard, and the huge benefit is that you only pay for use. Moreover, for the usage level, we will be well within the free tier (except the cost of keeping the model on S3, which is minimal). My code to deploy is here. Since you are deploying a Torchscript model, you just need PyTorch dependencies, and AWS has a nice docker file with all that you need. I had to add some python libraries for formatting the output and logging and they were all there. That means your actual python code is minimal and you don’t need to bring fastai (On the course thread another student shared her deployment tricks IF you need to also bring fastai to the deployment).

UX, response time.

Inference of the classification takes 0.2 seconds roughly, which is really fast, but the overall time for the user from selecting the image to getting the result can be up to 30s, or even fail. The extra time is partly uploading the image from the client to the server, and downscaling it before uploading if needed.

In real-life tests, the response time was a median of 1s, which is acceptable… except for the first times, it sometimes took up to 30s to respond for the first time. I think this is called “cold start”, and corresponds to the time of AWS pulling the Lambda from storage. To minimize this impact I added some code that triggers a ping to the server as soon as you load the client page. That ping just returns “pong” so it doesn’t consume much billing time, but it makes the trick of triggering AWS to get the lambda function ready for the real inference call.

Advocacy

This summer I have a small weekly section to talk about Impact Science on a spanish national radio, and I dedicated to talk in one episode about Artificial Intelligence and the impact on employment and Society. I presented this tool as an example. You can listen to it (in Spanish, timestamp 2h31m): Julia en la Onda, Onda Cero.

Next steps

I’d love to get your feedback and ideas. Or if you try to replicate it have problems, let me know.

These are some points already on my mind for the next sprint:

Re-train the model using a mobile-friendly like “MobileNetV2”
Re-train the model using as many currencies (and coins) as possible. The benefits of multi-category classification to detect the denomination should become visible as you add more currencies.
Add server code to upload a copy of the user images, as I did with the GCE deployment.
Smartphone apps with on-device inference.