The world’s leading publication for data science, AI, and ML professionals.

Build An Image & PDF Text Extraction Tool with Tesseract OCR Using Client-side JavaScript

PDF.js + Tesseract.js – A Fusion of OCR & Web Technologies. Full code implementation included.

Approximately a decade ago, Optical Character Recognition (OCR) Tools such as Tesseract OCR Engine can only be executed via binary formats such as C/C+ or packaged as wrapper classes including Windows executables (.exe), Python packages or Java Software Development Kits (JDK).

Following the emergence of Web Assembly (WASM) compilers, Tesseract OCR has since been compiled into the JavaScript plugin Tesseract.js (with sincere thanks to fellow Medium writer Jerome Wu). This has in turn enabled a full client-side JavaScript implementation of a PDF-to-Text application by combining functionalities of another JavaScript plugin –PDF.js.

Illustration by Author
Illustration by Author

There are in total 2 OCR-related side projects implemented as part of my self-exploration journey on OCR implementations. They are listed as follows:

Part 1: Image-to-Text

Build A Text-To-Speech App Using Client-Side JavaScript

Part 2: PDF-to-Text✶

✶ Similar to Part I: Build A Text-To-Speech App Using Client-Side JavaScript, the main principle in text extraction remains the same. The only additional intermediate step required is to convert pages of an uploaded PDF document into images which shall be later detailed in the implementation steps below.


Building a PDF-To-Text Application with Tesseract OCR

For this application, a self-hosted version of Tesseract.js v2 shall be implemented to enable offline usage and portability.

Step 1. Retrieve the following 4 files of Tesseract.js v2

<script src='js/tesseract/tesseract.min.js'></script>
  • Proceed to assign the respective worker attributes as constants
  • Encapsulate the worker instantiation into an async function
const tesseractWorkerPath='js/tesseract/worker.min.js';
const tesseractLangPath='js/tesseract/lang-data/4.0.0_best';
const tesseractCorePath='js/tesseract/tesseract-core.wasm.js';
var worker;
async function initTesseractWorker() {
  worker = Tesseract.createWorker({
    workerPath: tesseractWorkerPath,
    langPath:  tesseractLangPath,
    corePath: tesseractCorePath
  });    
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  return new Promise((resolve) => resolve('worker initialised.'));
}

Note: Since app is self-hosted, the relative paths need to be re-defined to local relative paths.

Step 2. Retrieve the following 2 files of PDF.js

Note: PDF Plugins were originally developed by Mozilla to render PDF via JavaScript. Original files can be found here.

Import plugin in browser:

<script src='js/pdf/pdf.min.js'></script>

Step 3. Create User Interface for PDF Upload

  • HTML File Input and PDF page no. display
<input id='uploadPDF' type='file' />
<hr>
Pg <span id='currentPageNo'></span> of <span id='totalPages'></span>
  • JavaScript Code Snippet
const pdfWorkerPath='js/pdf/pdf.worker.min.js';
const pixelRatio=window.devicePixelRatio*2;
var uploadPDF=document.getElementById('uploadPDF');
var currentPageNo=document.getElementById('currentPageNo');
var totalPages=document.getElementById('totalPages');
var _PDF_DOC, _PAGE, noOfPages, currentPage=1;
var _CANVAS=document.createElement('canvas');
function readFileAsDataURL(file) {
  return new Promise((resolve,reject) => {
    let fileredr = new FileReader();
    fileredr.onload = () => resolve(fileredr.result);
    fileredr.onerror = () => reject(fileredr);
    fileredr.readAsDataURL(file);
  });
}
const loadImage = (url) => new Promise((resolve, reject) => {
  const img = new Image();
  img.addEventListener('load', () => resolve(img));
  img.addEventListener('error', (err) => reject(err));
  img.src = url;
});
uploadPDF.addEventListener('change', function(evt) {
  let file = evt.currentTarget.files[0];
  if(!file) return;
  readFileAsDataURL(file).then((pdf_url) => {
    pdfjsLib.GlobalWorkerOptions.workerSrc=pdfWorkerPath;
    (async () => {
      _PDF_DOC = await pdfjsLib.getDocument({ url: pdf_url });
      noOfPages = _PDF_DOC.numPages;
      totalPages.innerHTML = noOfPages;
      while(currentPage<=noOfPages) {
        await initPdfTesseractWorker();
        currentPageNo.innerHTML=currentPage;
        _PAGE = await _PDF_DOC.getPage(pageNo);
        let pdfOriginalWidth = _PAGE.getViewport(1).width;
        let viewport = _PAGE.getViewport(1);
        let viewpointHeight=viewport.height;
        _CANVAS.width=pdfOriginalWidth*pixelRatio;
        _CANVAS.height=viewpointHeight*pixelRatio;
        _CANVAS['style']['width'] = `${pdfOriginalWidth}px`;
        _CANVAS['style']['height'] = `${viewpointHeight}px`;
        _CANVAS.getContext('2d').scale(pixelRatio, pixelRatio);
        var renderContext = {
          canvasContext: _CANVAS.getContext('2d'),
          viewport: viewport
        };
        await _PAGE.render(renderContext);
        let b64str=_CANVAS.toDataURL();
        let loadedImg = await loadImage(b64str);
        let result=await worker.recognize(loadedImg);
        let extractedData=result.data;

        let wordsArr=extractedData.words;
        let combinedText='';
        for(let w of wordsArr) {
          combinedText+=(w.text)+' ';
        }
        inputTxt.insertAdjacentText('beginend', combinedText);
        await worker.terminate();
        currentPage++;
      }
    })();
  }, false);
});

Explanation:

  • pdfjsLib.GlobalWorkerOptions.workerSrc=pdfWorkerPath; assigns the PDF plugin’s worker path to its global namespace
  • The variable _CANVAS is created programmatically because the PDF.js plugin renders each page onto a HTML Canvas Element
  • Upon upload of PDF document, file is read in base64 string as variable pdf_url to retrieve the _PDF_DOC object
  • A while-loop is written in order to process individual pages of the uploaded PDF document. For each page rendered onto a Canvas Element, the image data is extracted as the variable b64Str which is then parsed into the utility function loadImage(). This returns an Image() element for Tesseract’s worker to extract embedded text.
  • For each page image processed, inputTxt.insertAdjacentText('beginend', combinedText) appends all extracted text into the input field inputText until all pages of the PDF are processed.

Important Point: In each while-loop, a single page image is processed by a single worker instantiated. Hence, for subsequent individual pages, individual workers need to be instantiated again to extract the embedded text content.

Preview of Implementation

Screencapture by Author | Upon upload of sample.pdf file, text extraction for each page occurs and is appended to the text field below accordingly.
Screencapture by Author | Upon upload of sample.pdf file, text extraction for each page occurs and is appended to the text field below accordingly.

Full source code is available at my GitHub repo: Text-To-Speech-App or try it out at demo!

  • Note that additional functionalities have been added from Part I. They are:
Image by Author | Buttons with *(👆 )** are selectable by users for additional implementation details
Image by Author | Buttons with *(👆 )** are selectable by users for additional implementation details
Image by Author | Selection of [┏🠋┓ Download Text] enables users to download all extracted text content present within the text field
Image by Author | Selection of [┏🠋┓ Download Text] enables users to download all extracted text content present within the text field

Many thanks for persisting to the end of this article! ❤ Hope you have found this implementation helpful.

If you are interested or perhaps keen on more GIS, Data Analytics & Web application-related content, feel free to follow me on Medium. Would really appreciate it – 😀

— 🌮 Please buy me a Taco ξ(🎀 ˶❛◡❛)


Related Articles