Forays into Deep Learning: Character Recognition from Scratch

Analyzing Kuzushiji characters using a custom CNN.

Published in

Towards Data Science

15 min readJul 20, 2020

Note: For all the articles in this series, I’m following fast.ai’s courses on Practical Deep Learning and heeding Jeremy Howard’s advice: to work on projects and write about them.

This is a write up about my experience working towards an entry in the Kuzujishi Recognition Kaggle Competition. While the goal was to eventually submit entries to the competition, I used the data from the competition to build a character recognition network from scratch. The dataset looked like a nice bold twist on the regular MNIST dataset — the MNIST (Modified National Institute of Standards and Technology) is a database of over 60,000 handwritten characters used for recognition tasks.

Here, I discuss the first part of the project: creating a Kuzushiji character recognition model. I’ve skipped the setup of the data, since that required work specific to how the data was initially presented. Nonetheless, the code is available in my GitHub repo, the link for which is at the end of the article.

Kuzushiji Characters

When I started the project, the first thing I wanted to know was “How many characters do I need to recognize?” A csv of each character and its Unicode representation was provided, so I loaded that and took a peek:

unicodes = pd.read_csv(unicode_csv_path)
unicodes.shape(4781, 2)

There were 4781 characters! Further inspection revealed that a lot of those characters were symbols unrelated to the Kuzushiji character set. After filtering those symbols out while creating the dataset, I was left with 4213 sub folders. Each folder contained samples of a character that had appeared at least once in the dataset.

Before diving in, let’s take a look at some of the images I had to handle:

Kuzushiji characters with unicode labels extracted from the dataset (image by author)

That’s how I saved the images, but here is an example of the original image, unaltered:

Single Kuzushiji character (image by author)

The only difference is a grayscale transformation between the before and after image.

Loading the data

Loading the data was a challenge, but not in the way I expected. Here’s the full implementation of the databunch creation using data_block api:

db = ImageList.from_folder('./char_images') \
            .filter_by_func(filter_too_many) \
            .use_partial_data(0.3) \
            .split_by_rand_pct(0.3) \
            .label_from_folder() \
            .transform([[binarize()], [binarize()]], size=(32,32)) \
            .databunch(bs=16)

I’m going to dissect this one line at time.

First, I created an ImageList from the char_images folder. Next, I filtered the data with a custom function, filter_by_func(filter_too_many), which has a short implementation:

def filter_too_many(filename):
    num = filename.parts[-1].split('_')[-1].split('.')[0]
    return int(num) < 500

To answer the question “Why did you do this?” we need take a look at the data available. Here’s a histogram of the distribution of the counts of each character.

When I plotted this, the problem became evident. There were about 100 characters with greater than 1000 samples. The other 3,212 characters had far less. In fact, a lot only had 1 or 2 samples. It became clear to me that this was a very imbalanced dataset.

To put this in perspective, here’s a link to the distribution of English characters in a 40,000 word dataset. Here’s a link to the frequency of characters in Chinese. In English, the E to Z ratio is 171 to 1. The link to the Chinese character set shows there are at least 9000 characters. Kuzushiji has half that character count. To sum up, Kuzushiji isn’t special, and when dealing with character samples, this distribution is to be expected.

Imbalanced Data

Imbalanced data is a well studied problem in the data science ecosystem. High quality, well-balanced data is essential to training and validating models.

In handling imbalanced datasets, you need to implement a strategy of oversampling and undersampling. You oversample the minority data to meet the quantities of the majority. Undersampling is less common, since that equates to throwing away data. To simplify my problem, I oversampled the minority characters by creating 500 copies of them. Then, for the larger labels, I undersampled those character classes to 500 as well. That’s the responsibility of the filter_too_many function. It filters out any copies above 500 by filename.

A small caveat about the approach I’ve taken: for the characters with one or two samples, I created 500 copies of each, with zero alterations. After this, I treated the data as if it were perfect. This wasn’t ideal, as I learned from this article on the subject. Those copies appeared in both the training and the validation data. This likely boosted the model’s accuracy, since the images in the training and validation sets for that character were the same. Techniques such as data augmentation can greatly reduce this danger, because every character becomes different in some subtle way.

Handling a Large Dataset

Compared to other datasets out there, mine wasn’t actually that large. Yet, on my own machine, with a NVIDIA GeForce GTX 1080, training was a a bit of a chore. Thus, for my final iteration, I only used 30% of the data, implemented via use_partial_data(0.3). Just to get to this stage, I went through various versions of my databunch and model, which I will discuss later in this article. For now, it’s enough to know that my patience was wearing thin trying to train the model and waiting to see the results, so I stuck to 30%. Now I could leave the model training overnight and see results on my desktop the next morning.

Transforms

The next two lines are straightforward — I split the data into 70% training data and 30% validation, and I labeled them based on the folder each file was in. The next line transform([[binarize()],[binarize()]], size=(32,32))was where I spent some time. I wanted to understand how fastai’s transform api worked, and see if I could create one for my own dataset.

The Kaggle competition site noted that some pages had faint outlines of the characters on the other side of the page showing through. The model would need to recognize the foremost characters only, and not the fainter markings of the characters below. I decided that there must be, for each character, a pixel threshold that could differentiate background from character. It turned out that the idea worked well, removing faint background outlines and leaving the character well defined. Sometimes there were artifacts, but that depended on the original quality of the character image. Here’s an example of a before and after:

Example image from before, binarized. Notice the artifacts in top and bottom right

There were some intricacies I wasn’t expecting when I wrote the code for this. Look at the code for the function, and compare that again to the line where it was called:

I first created a transform function _binarize. Its implementation consisted of converting the tensor into a numpy array, using an opencv function, and then recreating the torch tensor. After adequately testing the function, I instantiated a TfmPixel with it. Finally, I passed the invoked value inside a list, and repeated the process twice. It appears that TfmPixel and the Transform class have a defined __call__ function. This allows the class to act as a function as well, which is why binarize() was valid. On top of that, fastai expects, for custom transforms, one list of transforms to apply on the training set, and one list of transforms to apply on the validation set. That’s why in the code, I put binarize() in both lists. This was one of the few parts of the fastai api that wasn’t immediately intuitive to me.

Batch size and Normalize

To wrap it up, I finished creating the databunch by passing a batch size of 16 and normalizing the data. Like I stated above, these are also variables that I played with in an attempt to improve training speed, which I’ll touch on later.

Creating the CNN

It would have been fairly easy for me to take what I did in my previous article and apply transfer learning to this problem, but I didn’t. I built my model from scratch for two reasons. The first, so I could learn how to properly write out the layers and interactions in the model. The second, to understand how changes in the model affected changes in the final results. This was what my final model looked like:

Convolution, ReLU, BatchNorm

These three layers contain most of the magic I used in my network. But my model didn’t start off looking like the code above. Instead, I started small, with just one or two layers, and worked my way up. In fact, I wrote the small network entirely in PyTorch first, and then rewrote it in fastai. This forced me to dig into the source code for simple things, such as “what’s the default stride size for the convolution filter” or “how does fastai’s conv_layer combine all these layers into one, and what defaults does it pass?”

I wrote the function _conv2d_layer to let me quickly decide how many channels I wanted to start with and not worry about the channel counts in the intermediate layers. After every stride 2 convolution, my network doubled in channels. I made this decision after looking at ResNet18, since I didn’t want to stray from what well-tested architectures used. The same applies to the number of channels I used. But, even as I wrote this article, I changed the model repeatedly, since new ideas kept popping up. In the final model, the architecture cuts off at 512 channels before going through the adaptive pooling layer. This was the final architecture, fully expanded, through the lens of the outputs:

In his fast.ai course, Jeremy discusses the specific order of using a convolution, followed by a rectified linear unit, or ReLU for short, followed by a batch normalization layer. A ReLU selects and zeroes out negative activations. In math, it's just max(0,x) where x is the activation. Batch normalization has a lot going on, and I don’t think I can do the explanation in math, but according to Jeremy, the gist of it comes down to defining something akin to y=ax+b where batch normalization adds the a and b terms to shift the mean and variance to where we want it to be. Order matters, and intuitively, a batch norm layer after the ReLU means the work of trying to shift the mean and variance doesn’t go to waste, since the activations won’t be zeroed by the ReLU.

It’s very important to note that, if you are reading the fastai v1 documentation, it says that the custom conv_layer “returns a sequence of nn.Conv2D, BatchNorm and a ReLU”. This implies that the batch normalization is done before the ReLU. But, the source code tells a different story, one of a ReLU before batch norm. I discovered this because I was very bothered that the sequence implied by the docs was different than what my gut was telling me. Intuition and a little “trust but verify” go a long way.

The Head

After customizing the convolution layers to what I felt worked the best, I was left with the simple task of mapping those activations to 4212 outputs for the 1-hot vector that my network would use to predict the outcome. This is referred to as the “head” of the network. As an example, when using a pre-trained model, it is often desirable to swap out the head of the model for a set of layers more specific to the task. The weights are still all there for the convolution layers of the network, but the final layers need a little training.

I kept my final layers straightforward, applying a pooling layer followed by batch normalization, dropout, and fully connected linear layers. The code above uses PoolFlatten and bn_drop_lin, both of which are fastai functions.

PoolFlatten creates an AdaptiveAvgPool2d layer first. Then, it tacks on a flatten to create a rank one tensor of size 512. Remember, before this, the architecture demonstrates how the convolutions are reducing the size of the input in width and height while increasing the number of channels. The activations go from 128x16x16 to 256x8x8 to 512x4x4. Then the pooling occurs, creating a rank one tensor of size 512. I’m still learning the particulars of adaptive average pooling, including when to use it versus max pooling or both. Regardless, at the end of a CNN, some kind of pooling is advised to turn the activations into the proper output shape for the last layer.

bn_drop_lin is another fastai treat that is better explained by the source code than words. The only thing I’ll add is that having the dropout layer reduces overfitting, which I know my model probably suffers from due to the characters that only had one sample in the dataset.

While you don’t see it, a final softmax layer exists in the network too. Fastai is smart enough to know that the model wants a softmax layer in the end to create a proper probability distribution. It knows this due to the nature of the output, a one-hot encoded vector, and the loss function, cross entropy, which is used for classification tasks.

Training the CNN

This was a very time consuming process. I’ll walk through everything I experimented with, but unfortunately, this is a log in my memory, not in a notepad or journal. Next time I prepare to do this, I need to write this down in a separate place so I can look over all the variables I’ve tried.

Optimizing GPU Usage

Initially, my training was taking two hours or more per epoch, and with a NVIDIA GeForce GTX 1080, I knew I was leaving something on the table. Sure enough, I was seeing only 3–10% GPU usage when looking at gpustat -cp -i. Originally, I was using a batch size of 128 or 256, because I wasn’t running out of memory. After reading plenty of forum posts, I realized that at this point, maybe my memory wasn’t the bottleneck, but the images going through my binarize transformer were. Plenty of authors suggested using optimized image libraries, but I thought that was too heavy handed for my simple task. Instead, I first took a much smaller section of my data. Then, I drastically reduced my batch size. Finally, I constantly reran a fit_one_cycle, not caring about the end result. I just ran it, observed speed, memory usage, GPU usage, then stopped it and repeated.

GPU at max usage (yes I’ve named my desktop smallrig)

For a 64x64 image, I started seeing good usage of my GPU with a batch size of 4. With a 32x32 image, a batch size of 16 seemed to do the trick, and it went a lot faster.

At first, I thought that I needed a lot more final activations before mapping them to a 4212 one-hot encoded tensor, so I forced the ending channel size up to 1024. But without an adequate starting number of channels, the image was so small that the model starting doing stride 1 convolutions on a 1x1 pixel. That didn’t make any sense. I had to bump up the initial number of my channels so that the input activations weren’t reduced to 1x1 until the final stride 2 convolution. But another issue cropped up. After training, the error rate was around 3%, with a training loss at 0.02 and a validation loss at 0.15. Plotting the losses wasn’t pleasant; the validation loss had not been decreasing in any kind of steady fashion even though the training loss was. Obviously I was overfitting, even with the dropout layer. So I backed off the channels, and let the activations settle to a 512x4x4 tensor before feeding it through the head to get the character probabilities. This model, when trained, did well.

Thus, I stuck with 32x32 images, a batch size of 16, 512 final channels for the model, and a final 512x4x4 tensor of activations before feeding it through the head to get the character probabilities. This model, when trained, did well.

Actual Training

Not too much to describe here, just a little bit of code and pictures. First, I tried to find the best learning rate:

learn.lr_find()
learn.recorder.plot(skip_end=20)

A great graph. Originally, I was discouraged, because the my initial graph looked very flat, with only a small trough at the end before a very large spike. However, as I learned from the forums, I needed to cut out the tail end of the graph. Without that, the loss explosion makes the y-axis balloon, which in turn makes the rest of the graph insignificant. Retrying it with the skip_end=20param worked really well. So well, in fact, that I couldn’t resist making this meme:

I could do this for weight initialization too. Or when a hard concept finally clicks. Don’t get me started.

Then, I fit the model for some time

learn.fit_one_cycle(7, slice(1e-3), callbacks=[SaveModelCallback(learn,every='epoch', name='30pct32sz16bsN_model')])

I was very excited to see an error rate of 0.017. Taking a look at the loss chart:

learn.recorder.plot_losses()

The final losses were 0.003 for training and 0.084 validation. Remember, lower loss on the training compared to the validation is expected and healthy, as long as the difference isn’t vast and they are both still trending downward. When I looked at the table of epochs, I was surprised to see that the error rate, validation loss, and training loss were all decreasing still, so there may have been room for more epochs. At this point though, I was satisfied with the results.

I tried to look at the top losses too but, whenever I ran

interp = ClassificationInterpretation.from_learner(learn)

I ended up with this:

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 3870777456 bytes. Error code 12 (Cannot allocate memory)

And sure enough, my RAM always looked like this, post attempt:

I have yet to investigate what I needed to do to get this work. Maybe I needed to delete objects, stop running Discord, etc. Anything to claw back extra memory. Or just do the simple thing, and buy more RAM.

Heatmap and Hooks

Jeremy does a cool thing in one of his lessons where he takes a look at the heat map of the final convolution and overlays it on the image of a cat his model has predicted is a Maine Coon. I decided to try this as well, since it’s a very gentle introduction to the power of callbacks that fastai provides. This article is getting pretty long, so I’ll spare you the code (it’s always available on my GitHub) and provide a picture instead.

Heatmap of the final convolution over the character it was tasked with recognizing

The heatmap did a pretty good job of outlining the details of the character. It missed the top right aspect of the character, but that’s all. Remember that this heatmap was generated from the final convolution. Thus, it’s only a 4x4 image produced by averaging over the 512 channels of the 512x4x4 tensor. That image was then interpolated to fit over a 32x32 image.

End: Lessons Learned

Creating a CNN was actually the easiest part of this character recognition exercise. PyTorch’s and fastai’s apis were both very easy to work with. In the cases where I wanted to learn more, viewing the fastai source code was usually all I needed to do. I loved being able to write the code in PyTorch, then rewrite it in fastai, all the while understanding everything I was writing. Of course, for some layers like PoolFlatten, I need to go back and make sure I understand the finer details, but I still had an intuitive understanding about what it was and why it fit.

Experiments are hard work. I should have been disciplined about tracking every variable changed and understanding its impact on the performance of the network. I wish I had written in the README each combination I tried before settling on the approach that maximized speed. That doesn’t begin to account for the changes in network to see how the model’s error rate was affected since I needed several epochs to see that. The one thing that helped was the ability to quickly spin up a notebook, test a function or two, and copy it to my main notebook for application.

It was fun and rewarding. MNIST gets boring. Kuzushiji characters were what I was expecting them to be — a bit more challenging. I spent about three weeks flipping through the videos, reading documentation, looking at the data, etc. In my previous article, I mentioned the 80/20 split between data prep and training. Because of the time to train, this project was much more of 40/60 split. Manipulating the model’s parameters to see what worked and what didn’t occupy a majority of my time.

As always, my code is available in my github repo. If you do look at the code, I implore you, please glance at the README so you aren’t completely confused by the several notebooks I have.

Thanks for making it to the end! Comments and questions are welcome.