Approximately a decade ago, Optical Character Recognition (OCR) Tools such as Tesseract OCR Engine can only be executed via binary formats such as C/C+ or packaged as wrapper classes including Windows executables (.exe), Python packages or Java Software Development Kits (JDK).
Following the emergence of Web Assembly (WASM) compilers, Tesseract OCR has since been compiled into the JavaScript plugin Tesseract.js (with sincere thanks to fellow Medium writer Jerome Wu). This has in turn enabled a full client-side JavaScript implementation of a PDF-to-Text application by combining functionalities of another JavaScript plugin –PDF.js.

There are in total 2 OCR-related side projects implemented as part of my self-exploration journey on OCR implementations. They are listed as follows:
Part 1: Image-to-Text
Part 2: PDF-to-Text✶
✶ Similar to Part I: Build A Text-To-Speech App Using Client-Side JavaScript, the main principle in text extraction remains the same. The only additional intermediate step required is to convert pages of an uploaded PDF document into images which shall be later detailed in the implementation steps below.
Building a PDF-To-Text Application with Tesseract OCR
For this application, a self-hosted version of Tesseract.js v2 shall be implemented to enable offline usage and portability.
Step 1. Retrieve the following 4 files of Tesseract.js v2
- tesseract.min.js
- worker.min.js
- tesseract-core.wasm.js
- eng.traineddata.gz*
- For simplicity, all text to be extracted are assumed to be in English
- Import plugin
<script src='js/tesseract/tesseract.min.js'></script>
- Proceed to assign the respective worker attributes as constants
- Encapsulate the worker instantiation into an
async function
const tesseractWorkerPath='js/tesseract/worker.min.js';
const tesseractLangPath='js/tesseract/lang-data/4.0.0_best';
const tesseractCorePath='js/tesseract/tesseract-core.wasm.js';
var worker;
async function initTesseractWorker() {
worker = Tesseract.createWorker({
workerPath: tesseractWorkerPath,
langPath: tesseractLangPath,
corePath: tesseractCorePath
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
return new Promise((resolve) => resolve('worker initialised.'));
}
Note: Since app is self-hosted, the relative paths need to be re-defined to local relative paths.
Step 2. Retrieve the following 2 files of PDF.js
Note: PDF Plugins were originally developed by Mozilla to render PDF via JavaScript. Original files can be found here.
Import plugin in browser:
<script src='js/pdf/pdf.min.js'></script>
Step 3. Create User Interface for PDF Upload
- HTML File Input and PDF page no. display
<input id='uploadPDF' type='file' />
<hr>
Pg <span id='currentPageNo'></span> of <span id='totalPages'></span>
- JavaScript Code Snippet
const pdfWorkerPath='js/pdf/pdf.worker.min.js';
const pixelRatio=window.devicePixelRatio*2;
var uploadPDF=document.getElementById('uploadPDF');
var currentPageNo=document.getElementById('currentPageNo');
var totalPages=document.getElementById('totalPages');
var _PDF_DOC, _PAGE, noOfPages, currentPage=1;
var _CANVAS=document.createElement('canvas');
function readFileAsDataURL(file) {
return new Promise((resolve,reject) => {
let fileredr = new FileReader();
fileredr.onload = () => resolve(fileredr.result);
fileredr.onerror = () => reject(fileredr);
fileredr.readAsDataURL(file);
});
}
const loadImage = (url) => new Promise((resolve, reject) => {
const img = new Image();
img.addEventListener('load', () => resolve(img));
img.addEventListener('error', (err) => reject(err));
img.src = url;
});
uploadPDF.addEventListener('change', function(evt) {
let file = evt.currentTarget.files[0];
if(!file) return;
readFileAsDataURL(file).then((pdf_url) => {
pdfjsLib.GlobalWorkerOptions.workerSrc=pdfWorkerPath;
(async () => {
_PDF_DOC = await pdfjsLib.getDocument({ url: pdf_url });
noOfPages = _PDF_DOC.numPages;
totalPages.innerHTML = noOfPages;
while(currentPage<=noOfPages) {
await initPdfTesseractWorker();
currentPageNo.innerHTML=currentPage;
_PAGE = await _PDF_DOC.getPage(pageNo);
let pdfOriginalWidth = _PAGE.getViewport(1).width;
let viewport = _PAGE.getViewport(1);
let viewpointHeight=viewport.height;
_CANVAS.width=pdfOriginalWidth*pixelRatio;
_CANVAS.height=viewpointHeight*pixelRatio;
_CANVAS['style']['width'] = `${pdfOriginalWidth}px`;
_CANVAS['style']['height'] = `${viewpointHeight}px`;
_CANVAS.getContext('2d').scale(pixelRatio, pixelRatio);
var renderContext = {
canvasContext: _CANVAS.getContext('2d'),
viewport: viewport
};
await _PAGE.render(renderContext);
let b64str=_CANVAS.toDataURL();
let loadedImg = await loadImage(b64str);
let result=await worker.recognize(loadedImg);
let extractedData=result.data;
let wordsArr=extractedData.words;
let combinedText='';
for(let w of wordsArr) {
combinedText+=(w.text)+' ';
}
inputTxt.insertAdjacentText('beginend', combinedText);
await worker.terminate();
currentPage++;
}
})();
}, false);
});
Explanation:
pdfjsLib.GlobalWorkerOptions.workerSrc=pdfWorkerPath;
assigns the PDF plugin’s worker path to its global namespace- The variable
_CANVAS
is created programmatically because the PDF.js plugin renders each page onto a HTML Canvas Element - Upon upload of PDF document, file is read in base64 string as variable
pdf_url
to retrieve the_PDF_DOC
object - A while-loop is written in order to process individual pages of the uploaded PDF document. For each page rendered onto a Canvas Element, the image data is extracted as the variable
b64Str
which is then parsed into the utility functionloadImage()
. This returns anImage()
element for Tesseract’s worker to extract embedded text. - For each page image processed,
inputTxt.insertAdjacentText('beginend', combinedText)
appends all extracted text into the input fieldinputText
until all pages of the PDF are processed.
Important Point: In each while-loop, a single page image is processed by a single worker instantiated. Hence, for subsequent individual pages, individual workers need to be instantiated again to extract the embedded text content.
Preview of Implementation

Full source code is available at my GitHub repo: Text-To-Speech-App or try it out at demo!
- Note that additional functionalities have been added from Part I. They are:

![Image by Author | Selection of [┏🠋┓ Download Text] enables users to download all extracted text content present within the text field](https://towardsdatascience.com/wp-content/uploads/2022/04/1ukcZ7cKa0w3Qm3jFnpiHSA.png)
Many thanks for persisting to the end of this article! ❤ Hope you have found this implementation helpful.
If you are interested or perhaps keen on more GIS, Data Analytics & Web application-related content, feel free to follow me on Medium. Would really appreciate it – 😀
— 🌮 Please buy me a Taco ξ(🎀 ˶❛◡❛)