OCR with Akka, Tesseract, and JavaCV

Duane Bester
Towards Data Science
5 min readJun 1, 2018

--

I recently had a use case where I needed to extract names and dates from PDF documents. I thought that spinning up a quick program leveraging google’s tesseract to perform basic OCR would be easy enough. With a few lines of code, you can get node-tesseract running OCR on an image. However, if the image is skewed, noisy, or has a bunch of images within it, the text result from tesseract becomes unusable.

The Tesseract documentation lists a bunch of ways to pre-process an image to improve OCR quality:

  • Re-scaling
  • Binarization
  • Noise Removal
  • Rotation (de-skewing)
  • Border Removal

That was a lot of steps, to just extract the text from an image. I still needed to perform date extraction and named entity extraction. The problem was that I was unfamiliar with these pre-processing and extraction techniques. So I thought it would be a good idea to build a system that was capable of having a pluggable architecture — where I could add steps as I learned how to implement them. Wouldn’t it be nice if images could flow through these different transformation stages? 😉

Akka Stream

The Journey

Let’s build a REST server that accepts an image as an upload where various endpoints will return various results. The Tess4j library has a lot of convenience methods, some of which we will use for image de-skew and binarization. We will use JavaCV, an OpenCV wrapper for image noise removal and general enhancement.

We will eventually build in date and name extraction with OpenNLP and Natty as well as using a spellchecker to enhance OCR output — Part 2.

I took a picture of a page in a book. This will serve as the input to the system. Notice the slight angle of the picture.

input.jpg

Required Software Libraries

Let’s create a trait to hold tesseract; we will be able to mix this in later.

Now, we will create our base App, using Akka Http, which will bind to port 8080. We create a rest endpoint that will read an image into memory and does nothing with it. This endpoint will eventually return a pre-processed image.

As we can see, our image is of type BufferedImage and we can use Tess4j’s helper methods to create a Binary image. We will create a Flow of the BufferedImage that will map the helper function onto incoming BufferedImages. The Binarization will convert the image to black and white. This won’t work well for images that have shadows over the text, but I’ll elaborate more on this later.

We can then use Tess4j again to de-skew the image. Let’s create two more flows: one that de-skews a BufferedImage with a minimum de-skew angle threshold and one that gets the Bytes from a BufferedImage so that we can send the bytes back to the client.

Now we can combine all the above flows and update our server. We will make a Source out of our in-memory image, chain the flows together and pass this to Akka’s complete() method.

Running a simple curl against our server will give us back our pre-processed image.

curl -X POST -F 'fileUpload=@/Users/duanebester/Desktop/blog/input.jpg' 'http://localhost:8080/image/process' --output output.png

Here is our binary, de-skewed image:

Awesome, right!? Now that we have everything flowing, we can simply add more pieces to further enhance our image. At this point we can pass the image to tesseract and have it perform OCR to give us a string. However, I feel there is so much image processing magic in OpenCV that we should use.

JavaCV and OpenCV

JavaCV and OpenCV use an object called a Mat to perform their image processing. The challenge is getting a Java BufferedImage to a JavaCV Mat and back again, so here are the flows to make this happen:

We can now flow a BufferedImage ~> Mat ~> BufferedImage 😎

OpenCV has fastNlMeansDenoising and detailEnhance methods that we can use with a Mat — so let’s wrap these methods in a Flow[Mat]

We can now

  1. Create a binary BufferedImage
  2. Convert it to a Mat and enhance it
  3. Convert it back to a BufferedImage
  4. De-skew the BufferedImage and then send it back to the client
A smoother & enhanced image

We could add in scaling here so that we send a smaller image, basically cropping the text in the above image for tesseract to work on, but haven’t gotten that far yet 😅

The last piece is to perform the OCR on the BufferedImage and send the result String back to the client. We create a Flow[BufferedImage] that will return a String — We also update our web server to add in these additional flows.

When we run our new curl to just get the JSON response:

curl -X POST -F 'fileUpload=@/Users/duanebester/Desktop/blog/input.jpg' 'http://localhost:8080/image/ocr'

We get our text result:

CHAPTER 1
THE COMPOUND EFFECT IN ACTION
You know that expression, ”Slow and steady Wins the
race”? Ever heard the story of the tortoise and the hare?
Ladies and gentlemen, I’m the tortoise. Give me enough
time, and I will beat Virtually anybody, anytime, in any
competition. Why? Not because I’m the best or the smartest
or the fastest. I’ll win because of the positive habits I’ve
developed, and because of the consistency I use in applying
those habits. I’m the world’s biggest believer in consistency.
I’m living proof that it’s the ultimate key to success, yet it’s
one of the biggest pitfalls for people struggling to achieve.
Most people don’t know how to sustain it. I do. I have my
father to thank for that. In essence, he was my first coach
for igniting the power of the Compound Effect.
My parents divorced when I was eighteen mohths 01d,
and my dad raised me as a single father. He wasn t exactly

The page does get wavy towards the bottom of the picture, and tesseract misrecognized months old and forgot the apostrophe in wasn’t. Nevertheless, we achieved 99.67% accuracy in the above conversion!

As you can see, with this system, it’s super easy to add in the pieces we need. We can look at taking the String result, and passing it to more stages that can perform certain spell checks and name/date extractions.

Thanks and continue to Part 2!

~ Duane

--

--