Word Predictor from Handwritten Text
An implementation of image to word predictor in R.
It’s been long since I contributed to the community. I am back to give what was due. But before that, let me tell you what I was up to all this time. The highlights of all these months professionally have been two things. One, I spoke at a data science conference in March (Mumbai edition of WiDS). Two, I became an open source contributor and got a pull request merged into numpy package. Yayyy!
Okay, let’s get started with some machine learning. I will briefly talk about the modelling process and point you to the github repo which hosts all the files and codes for implementing the same.
Create a model to identify 5-letter english words from hadwritten text images. These words are created using the letters from EMNIST dataset which is a set of handwritten character digits converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset. You can get details about it from here.
To solve this problem, we will build two models.
- Image Model: A word predictor which will take in an image and guess what is written in the image by carrying out image recognition using Multi-Layer Perceptron.
- Text Model: This will supplement the image model. The task of this model will be to give out the probability of next character based on previous character using Hidden Markov Model. For example, what will be the probability of an e after the letter n or a or x.
Please refer this git repo for following the below steps -
Character_Recognition - Using MLP for identifying words from imagesgithub.com
1. Download the dataset from EMNIST website:
- Scroll down and click on ‘Matlab format dataset’
- A ‘Matlab.zip’ will get downloaded. Unzip it and you will find 6 MS Access datasets. We are using ‘emnist-byclass’ dataset for this task because it has all the characters
- Read this dataset in R the first time by using ‘R.matlab’ library
- Save the dataset in R format (Rdata) after you load it for later use
The EMNIST dataset contains 697,932 labeled images of size 28 x 28 in the training set and 116, 323 labeled images in the test set. The images are represented in gray-scale,where each pixel value, from 0 to 255, represents its darkness level.
Testing Set: Download the file ‘emnist words.Rdata’ from github repo. The file contains examples of 1000 words of length 5. This test set was created using letter by letter images from EMNIST dataset. Variable X contains the images of words, variable y contains the true words (labels), and variable y_hat contains predictions by a 2 layers neural network, combined with a 2-nd order Markov model.
Dataset Format for Prediction: We need to get the raw input data in a particular format before calling the prediction function. The input X that the prediction function takes is of size 28n x 28L where n is the number of input samples and L is the length of words (for example, for 100 words of length 5, X will be of size 2800 x 140). The predictor function is defined in ‘Word_Predictor_Champ.R’ file.
2. Download support files from github:
‘EMNIST_general_functions.R’- This file has helper functions to allow accessing the EMNIST dataset, calculate the error, create new words using the EMNIST dataset.
‘EMNIST_run.R’- This is the code execution file. Put all the files from github repo in a directory of your choice, and run the ‘EMNIST_run.R’ file. This will give out the character accuracy and the word accuracy to evaluate the model.
3. Build the Models:
‘Code_image_text.R’ file has the code for training the image and text models. The files ‘EMNIST_Model_Champ.h5’ and ‘TEXT_Model_Champ.Rdata’ are created using the code in this file. These both are image and text models respectively. Let’s talk about the code in here -
Installing the requirements:
We will be using tensorflow for building the multilayer neural network for image recognition. You only need to do this once and later, just call the library() functions from the code for loading the requirements into your environment
Preparing the Data for Image Model:
EMNIST data we downloaded is already having train and test datasets seperately. Read in the data usign the EMNIST_read() function defined in the file with general functions. Since every letter or digit defined in the dataset is a label and not a continuous value, let us also convert the y into a categorical variable with 62 classes (emnist_byclass dataset)
Building the Image Model:
We will build a simple milti-layered neural network by alternating dense layers and dropout layers. Drop-out layer helps in generalization of the model by droping specified percentage of connections between two dense layers.
We have used Adam optimizer and accuracy as the metrics to fit the model. Finally, we save the model in h5 format. The ‘h5’ format is used to save models learned using keras which are not saved correctly using R’s ‘save’ command.
The idea here, was to build a simple image recognition model using neural network which will take in the image pixel values as features and respective labels as the target variable.
Preparing the Data for Text Model:
Just like we did data preparation to build the image model, we need to prepare the dataset for text analytics. The attempt here is to train a model which knows English, so that it can work to supplement the predictions from images. For example the word ‘never’. If the image model predicts an ‘e’ as a ‘c’ because of the unclear handwriting, the text model can correct it since it knows the language. The text model will know that for a 5 letter word, the probability of ‘e’ after ‘nev’ is higher as compared to the probability of a ‘c’
To accomplish this, we will take an english corpus (a collection of written texts), build a vocabulary out of it and then tell the model to learn features from it. Specifically, to learn the probabilities of next character, given the previous character.
We have written a function here, which does the data cleaning part starting from removing the punctuations from the text, filtering 5-letter words, turning all words to lower case, taking words with only letters (and not digits) and finally, choosing only the unique words from this refined corpus.
Building the Text Model:
Now that we have a pre-processed corpus, we can use it to learn language features. For this task, we have written a Hidden Markov Model which basically gets the probability of X(i+1) given X(i) using frequency tables. Finally, let’s save this model as ‘Text_Model_Champ.Rdata’ file.
Note: You can use any text corpus to build this model. As usual, the performance of the model depends on the quality and quantity of the training data provided. On the github repo, I have put an example corpus (War_and_Peace.txt) for your reference/use.
4. Predict the words:
‘Word_Predictor_Champ.R’ file contains the prediction function which takes as input a 2D array X with pixel values, a learned model for images m:image, and a learned text model m:text. The function returns a 1D array of string labels y, where each element of the array is a string of length L.
This function first converts the input images in the form of required matrix with (nxl, 28x28) shape. It then uses image model to get the probability of every character. Gets the probability of next character using text model (except the first character). Combines probabilities from these both to make the final prediction.
5. Model Improvements:
This is a basic model which gives ~95% accuracy at character level. Below are a few ideas you can try to improve this model:
- Use a huge text corpus to improve upon the text model. You can use Wikipedia corpus to train a better text model. Or, use a pre-trained model from the web
- Improve the MLP model by adding more hidden layers/hidden units. Or, use more sophisticated image models like CNNs
- Generalize the model by editing the Code_image_text file. Currently, the code has some restrictions as mentioned below. By editing some hard-coded parts in this file, the model can be made more general purpose. a. Only predicts on 28 x 28 pixel images from the EMNIST dataset b. Only predicts and gets trained on 5 letter words EMNIST_general_functions file contains function that helps you create word images from the EMNIST dataset. This function can be used to create different training/testing sets eliminating the restrictions of 5 word length