Train a custom Tesseract OCR model as an alternative to Google vision for reading children’s handwriting

A Story Squad project

Sylvia Burris
Towards Data Science

--

The co-founders of Story Squad, Graig Peterson and Darwin Johnson are pioneering a platform that gives kids an opportunity to be creative, improve on their reading and writing while reducing the time they spend on screens. Story Squad keeps the process fun by gamifying it.

Image provided by Story Squad

So, how does the game keep children engaged but not on their screens?

Here is a video explaining Story squad:

Video by Graig Peterson (co-founder of Story Squad)

What is OCR?
Optical character recognition (OCR) is a solution for automating data extraction from printed or handwritten text from a scanned document or image file. The text is then converted into a form that can be manipulated through data processing. OCR has many applications, for example, data entry for business documents such as cheques, passports, bank statements and receipts, automatic number plate recognition, and traffic sign recognition.

Why do we need OCR for this project?

We need to transcribe kids’ uploaded stories and process the transcriptions. In the game, each kid needs to be paired with another kid with a similar writing level, and each team should match up with another team in the same category, for the game to be competitive and fun. This is where the data science team comes in.

  • Using NLP tools to find complexity levels of writing, the players can be grouped appropriately.
  • We’ll need OCR to create a word cloud on the parent dashboard, for parents to see what their kids have been writing.
  • We’ll need to monitor the child’s progress (Has their complexity level improved over time? How many words on average are they writing?)
  • We also want to flag inappropriate words. Although each story will be looked at with human eyes, we’ll take all precautions to make the platform a safe space for kids by adding a custom profanity filter.

Tesseract OCR Vs Google vision OCR
Google vision is transcribing handwritten images at an accuracy score of about 80–95%. The problem we are facing right now is that this service comes with a price tag. The stakeholders want to reduce these costs.

The DS team is tasked with training a tesseract OCR model, an open-source OCR, as an alternative to Google vision.

Tesseract OCR model training Cycle

Image provided by the author
  1. Data preparation: Data cleaning and labelling

Tesseract OCR takes in segmented handwritten images and their corresponding transcribed texts (ground truth). The pair need to have the same name <name>.tif for the image or <name>.gt.txt for the transcribed text file. The model can also be trained on png images.

Image provided by the Author

“Celina, “she would say, ’’ when the world

If there is a file naming mismatch, training will be interrupted. You’ll have to correct the mismatch by either deleting or renaming the affected files. There is a script in the repo (utils/check_groundtruth_files_pair.py) that checks if there are mismatches. Those doing the data clean up can correct these mistakes before pushing the data to the repo. Also, due to COPPA, full images of children’s writing are not allowed on GitHub. There is a script that checks for full images in the data (utils/essaycheck_w_image_height.py).

In the past, data preparation was done in two different ways. Images in folders 32- and 52- were preprocessed by binarizing the images and then segmenting them using a script on the command line. Folders 31- and 51-(only the last 5 stories) were cleaned and labelled manually using the paint software and another manual python script (not included in repo). Folder 31 was not binarized prior to the clipping. You can find the cleaned and labelled data in the ds repo/data/story-squad-ground-truth.

A data pipeline was developed to semi-automate the data preparation process. You can find it in the data_management folder.

2. Tesseract model training and tuning

  • GPU box will be set up by the instructor
  • download the pem key to get permissions to get into the EE2 machine
  • In the terminal cd into the dir with the pem key and run the ssh command to get into the EE2 machine
  • git clone the ds repo ( If there is new data in the repo, move it to tesstrain\\data\\storysquad-ground-truth dir)
  • In the root directory run the source training_setup.sh command
  • cd into tesstrain and activate virtual environment by running the command source ocr\\bin\\activate if it worked you should have switched to (OCR) environment
  • In the tesstrain dir run this command to train the model make training MODEL_NAME=storysquad START_MODEL=eng TESSDATA=/home/ubuntu/tesseract/tessdata
  • The above command will train the model using default hyperparameters. To hyperparameter tune, the command line will have to be modified to indicate the hyperparameter for example run training MODEL_NAME=storysquad START_MODEL=eng PSM=7 TESSDATA=/home/ubuntu/tesseract/tessdata if you want to set PSM to 7 instead of using the default value.
  • Note: This takes about an hour to run. If you have any mismatches between ground_truth and snippet An error will spit out. You need to go to the story-squad-ground-truth dir and fix the problem by either deleting or uploading a corresponding file. When that problem is fixed, just re-run the training command again and the model will start from where an error was made.
  • Once the model has finished training, the new model will be stored in tesstrain\\data. The default name right now is storysquad.traineddata. New models will overwrite the old ones if they are not moved to another directory or renamed.

Due to the EC2 instance not having enough space for tesseract to train on more data, a docker file `ds repo\Dockerfile_tesseract_training` was set up for future tesseract training.

3 & 4 Data preprocessing and Transcription

Data preprocessing is done before using the new model to transcribe images. Each image requires different preprocessing methods. If the image is printed text, it requires little to no preprocessing. There are few preprocessing methods that are universal, you always want to make sure the image is not skewed and is binarized in one form or the other. Printed texts don’t need much preprocessing. Tesseract OCR reads printed text and neatly written words accurately but might struggle to transcribe unique fonts and poor quality images.

Images provided by the author

The model performance was tested using ten stories and their corresponding ground truth (human transcribed). We also used one neat handwriting image as a control( this image requires little to no preprocessing).

This stage involves using NLP techniques to improve the accuracy of the transcribed text. For most OCR applications, this stage is essential. However, the stakeholders want children’s words to be preserved. If they misspelt a word, NLP techniques might correct it.
We focussed on evaluating the model using accuracy scores. We run into a problem with accuracy scores. The model gave high accuracies for some images it did not transcribe at all. Future cohorts might consider using other evaluation metrics like cosine similarity, CER (character error rate) and WER (word error rate).

Moment of Truth

We made some progress with the model on the neat handwritten image (full_sample2.png). The accuracy score increased by 13%. Unfortunately, the model is not performing well on children’s handwritten images. We got accuracies between 0–20%. Preprocessing images improved some of the scores by about 10%.

Image provided by the Author

What’s next?

We still have a long way to go with the model. The DS team needs to discuss a way forward with managers and stakeholders.

Here are some of the tasks we think will help to continue the work that we did.

  • Clean more data
  • Build an ML infrastructure for automating important information about the model structure, hyperparameters and performance metrics
  • Optimization of the data pipeline
  • Use the Docker container for OCR training
  • Build a preprocessing pipeline

References

The Lambda school scribble stadium DS repo: https://github.com/Lambda-School-Labs/scribble-stadium-ds

Children’s Online Privacy Protection Rule (“COPPA”): https://www.ftc.gov/enforcement/rules/rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule

Story Squad (Scribble stadium): https://www.storysquad.education/

--

--