A Better Way to Process Images for OCR

I wish I had discovered ScanTailor sooner

Ang Li-Lian
Towards Data Science

--

In my last article, I walked you through processing images for OCR using Python code. OCR requires a clear view of text that can be read from left to right. The more the text looks like it was just printed, the more accurate the results. I recently wrote a more intelligent algorithm and found an open-source software ScanTailor which does a much better job at processing images, especially with geometric warping (i.e. when text is curled)! I hope that this method will help develop more accurate page cropping algorithms and resurface interest in improving ScanTailor. Check out the previous article for more details on each step of image processing.

The previous code worked well in splitting the image into single columns of text for relatively clean images but performed poorly when images were very distorted or ‘dirty’. The algorithm relied on finding a clean white space between the columns, but there were a lot of black areas on all sides which could not be cut out easily.

The new algorithmic approach used OpenCV’s EAST text detector to find the bounding boxes of all the text. Then, I created a histogram based on all the leftmost vertical edges of the bounding boxes. (The right edge of the bounding boxes is less accurate than the left, meaning the ends of the words are usually clipped)

“OpenCV’s EAST text detector is a deep learning model, based on a novel architecture and training pattern. It is capable of (1) running at near real-time at 13 FPS on 720p images and (2) obtains state-of-the-art text detection accuracy.”

Bounding boxes identified on a scanned page by OpenCV’s EAST text detector. [Graphic by author. Pages by Quién es quién en Colombia, 2nd ed. (Bogotá: Oliverio Perry, 1948).]

Next, I searched for the white centre column based on the longest consecutive ‘empty’ space and the start of the second text column from the highest subsequent spike in bounding boxes. The start of the second text column is the crop location. We want to crop the point between where the white centre column starts and where the second text column’s edge starts. But if we average those positions, the crop position tends to be too far left. (We are working with a narrow margin of error here!)

Histogram of the number of bounding boxes with the left edge at each length wise position on the image. [Graphic by author]

BUT some pages were skewed, so there was no 0-degree vertical line until the image was rotated. I solved this by rotating the image by 0.5-degree increments clockwise and anti-clockwise. Each rotation was scored by the longest consecutive ‘empty’ space. Getting the bounding boxes took a fair amount of time, so to speed up the code if the score is ≥8, I settled with that rotation angle. The choice was based on observation and trial and error.

I was initially really happy that the images were cropped almost perfectly with a 0.4% error rate — until I ran the OCR. The results were terrible. None of the text close to the bookbinding or on pages where the text was warped was read. OCR assumes that the text it is looking for can fit into a rectangular box.

Poor approximation of where bounding boxes should be when the page is warped. [Graphics by author; Pages by Quién es quién en Colombia, 2nd ed. (Bogotá: Oliverio Perry, 1948).]

While searching for a way to unwrap the text on Python, I stumbled on an open-source software (completed until the production phase but has not been maintained since 2014) called ScanTailor. It performs page splitting, skew correction, content selection, binarisation and margin fixes and the experimental version additionally corrects for geometric distortions and despeckling.

The software is not perfect, and I had to correct many pages manually, but its interface makes it incredibly easy to do so with simple dragging and clicking. It took me about an hour to correct mistakes, but it was a good trade-off compared to the many more hours spent writing the actual code. Also, note that pages with very little text might end up scrubbed!

Screenshot of ScanTailor program after processing the previous image to fix geometric warping, splitting the page, selecting content and despeckling. [Graphics by author; Pages by Quién es quién en Colombia, 2nd ed. (Bogotá: Oliverio Perry, 1948).]

ScanTailor is also much faster, taking 1 hour to process my image batch compared to the 8 hours it took my code. However, the software is slightly unstable, crashing when too much RAM is used, so I suggest breaking down each run into batches and saving each time.

The forum offered several other options, but this was the most straightforward and best solution I found. (I use Windows!)

This no-code solution is definitely a big plus for those without technical skills, so I hope this article helped prepare your images for OCR!

--

--

I write to help fellow Digital Humanities enthusiasts navigate the complex and beautiful forest of methods. Talk to me about social impact & fiction novels!