Deploy Python Tesseract OCR on Heroku

Step by step approach (including screenshots & code) on how to create a OpenCV + Tesseract OCR on Heroku

Published in

Towards Data Science

5 min readMay 4, 2020

Having worked in Data Science for almost half a decade now, I have realised that many of us specialise and emphasise on improving algorithms’ accuracy rather than concentrate on its usability in real life. This generally leads to a difference in expectation management between the business and the data science teams. And even if we create one of the most sophisticated algorithm, it is of no use until effectively deployed at the core source. In this blog, we will learn how to deploy an OCR using Tesseract & OpenCV written in Python on Heroku platform.

But first things first — What is Heroku?

Heroku is a cloud platform — a service (PaaS) supporting several programming languages. It supports multiple coding languages such as Java, Node.js, Scala, Clojure, Python, PHP, and Go. It has features for a developer to build, run and scale applications in a similar manner across most languages.
In short — we can use Heroku to deploy our python model into production such that it can be used by anyone and has no dependence on my desktop or operating system.

So what are we actually gonna do?

Index

Create a Tesseract OCR + OpenCV code on Python
Freeze Dependencies
Create Procfile
Create Aptfile
Configure a Heroku account
Copy the codes to Heroku server
Add Buildpacks & Config File
Test our OCR app

Create a Tesseract OCR + OpenCV code on Python

The code mentioned does the following:
→ Input: Image file(.jpg, .png, etc)
→ OpenCV: Read the image
→ Tesseract: Perform OCR on the image & print out the text
→ FastAPI: Wrap up the above code to create an deployable API

###########pythoncode.py###############import numpy as np
import sys, os
from fastapi import FastAPI, UploadFile, File
from starlette.requests import Request
import io
import cv2
import pytesseract
import re
from pydantic import BaseModeldef read_img(img):
 pytesseract.pytesseract.tesseract_cmd = ‘/app/.apt/usr/bin/tesseract’
 text = pytesseract.image_to_string(img)
 return(text)
 
app = FastAPI()class ImageType(BaseModel):
 url: str@app.post(“/predict/”) 
def prediction(request: Request, 
 file: bytes = File(…)):if request.method == “POST”:
 image_stream = io.BytesIO(file)
 image_stream.seek(0)
 file_bytes = np.asarray(bytearray(image_stream.read()), dtype=np.uint8)
 frame = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
 label = read_img(frame)
 return label
 return “No post request found”

pytesseract.pytesseract.tesseract_cmd = ‘/app/.apt/usr/bin/tesseract’ — This is a very important line of code and don’t forget to add this in your code.

Freeze Dependencies

We will need to save project related dependencies (libraries used) in requirements.txt — Quick shortcut

pip freeze > requirements.txt

Create Procfile

Procfile — Heroku apps include a Procfile that specifies the commands that are executed by the app on startup. You can use a Procfile to declare a variety of process types, including:

Your app’s web server
Multiple types of worker processes
A singleton process, such as a clock
Tasks to run before a new release is deployed

web: gunicorn -w 4 -k uvicorn.workers.UvicornWorker pythoncode:app

*Note that there is* NO *extension to Procfile*

Create Aptfile

There are couple of python packages that are not supported by Heroku. To make them work on Heroku, we need to use buildpacks. (We will create one for Tesseract). Then use the package by including any APT package in an Aptfile in your application. The buildpack will then install these packages on the dyno when we deploy your application.

For our purpose we will add the following in our Aptfile

tesseract-ocr
tesseract-ocr-eng

*Note that there is* NO *extension to Aptfile*

Note: We will add Tesseract Buildpack on the Heroku console

Lastly, save all these files in one folder

Configure a Heroku account

→ Create a new account or login to https://id.heroku.com/login

→ Create a new app and give it a name

→ You should have logged in-the-console

Copy the codes to Heroku server

I am using Heroku CLI (command line interface) to push the codes on Heroku but you can use Github too. Github has some daily limit as to how many number of times the codes can be pushed and therefore I prefer to use CLI.

Heroku has detailed out the steps very neatly for the user

This will interrelate to the following in the command line-

$ heroku login
$ cd /Users/shirishgupta/Desktop/Heroku/heroku_sample/
$ git init
$ heroku git:remote -a tesseractsample

$ git add .
$ git commit -am "make it better"
$ git push heroku master

It shows that the app has been deployed but we still need to do one thing before we actually make it work.

Add Buildpack

→ In your Heroku console go to Settings
→ Add the following Buildpack

https://github.com/heroku/heroku-buildpack-apt

→ Add Tesseract config file

TESSDATA_PREFIX = ./.apt/usr/share/tesseract-ocr/4.00/tessdata

Hint - Find the correct path of Tesseract by entering the following in heroku terminal

$ heroku run bash
$ find -iname tessdata

Test our OCR app

Having completed all the steps, we are all set to test our OCR. Go to heroku consule:
“Open App” or enter ‘https://tesseractsample.herokuapp.com/docs’

Input Image

Output Image

The end for now. Have any ideas to improve this or want me to try any new ideas? Please give your suggestions in the comments. Adios.

Check out the blog on - Deploy Python + Tesseract + OpenCV Fast API using AWS EC2 Instance

Deploy Python + Tesseract + OpenCV Fast API using AWS EC2 Instance

Step by step approach (including screenshots & code) on how to create an AWS EC2 instance and deploy your code for…

towardsdatascience.com