OCR-Free Document Understanding with Donut

Use the recently-released Transformers model to generate JSON representations of your document data

Published in

Towards Data Science

8 min readAug 5, 2022

Visual document understanding (VDU) is a heavily researched new field in deep learning and data science, particularly because there is a wealth of unstructured data in PDFs or document scans. Recent models, such as LayoutLM, utilize a transformers deep learning model architecture to label words or answer given questions based on an image of a document (for example, you might either highlight and label the account number by annotating the image itself, or ask the model, “What is the account number?”). Libraries such as HuggingFace’s transformers make it easier to work with open-source transformers models.

Most traditional answers to the problem of VDU rely on parsing the OCR output of that image, along with visual encodings, but OCR is computationally expensive (as it typically requires the installation of an OCR engine like Tesseract) and the inclusion of yet another model in the complete pipeline leads to another model that must be trained and fine-tuned — and an inaccurate OCR model will lead to error propagation in the VDU model.

Thus, researchers from Naver CLOVA proposed an end-to-end VDU solution [1] that uses an encoder-decoder transformers model architecture and recently made it available to use with the HuggingFace transformers library. In other words, it encodes the image (split into patches using a Swin Transformer) into token vectors it can then decode, or translate, into an output sequence in the form of a data structure (which can then be further parsed into JSON) using the BART decoder model, publicly pretrained on multilingual datasets. Any prompts fed into the model at inference time can also be decoded as well in the same architecture.

You can see a demo of Donut fine-tuned on the CORD receipts dataset here. They’ve provided a sample receipt image to try it out, but you can also test it on a number of other document images. When I tested it on this image:

I got the result:

{
    nm: "Presentation"
}

which indicates that it detected the “Presentation” title to be the name of an item on the menu or receipt.

The authors have also provided training and testing scripts, so we can demonstrate how to actually use the models in practice (I’ll be using the SROIE dataset [2], a dataset of labelled receipts and invoices, to demonstrate fine-tuning on a custom dataset). I‘d recommend running the code on a GPU, as both inference and training will take quite a while on CPU. Google Colab offers free GPU access and should be enough to finetune (go to Runtime > Change runtime type to switch from CPU to GPU).

First, let’s make sure we have GPU access.

import torch
print("CUDA available:", torch.cuda.is_available())
!nvcc --version

And now we can download the relevant files and libraries. The following lines of code should install all dependencies, including the donut library (though you can manually install this using pip install donut-python, the codebase cloned from Github includes important training and testing scripts).

!git clone https://github.com/clovaai/donut.git
!cd donut && pip install .

Making an inference using the CORD fine-tuned model

First, we’ll demonstrate basic usage of the model.

from donut import DonutModel
from PIL import Image
import torch
model = DonutModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
if torch.cuda.is_available():
    model.half() 
    device = torch.device("cuda") 
    model.to(device) 
else: 
    model.encoder.to(torch.bfloat16)
model.eval() 
image = Image.open("./donut/misc/sample_image_cord_test_receipt_00004.png")
    .convert("RGB")
output = model.inference(image=image, prompt="<s_cord-v2>")
output

In the DonutModel.from_pretrained() call, I’ve simply specified the name of the pretrained model from the HuggingFace Hub (the necessary files are downloaded at this time), though I can also specify the local path to a model folder, as we’ll demonstrate later. The Donut codebase also includes a sample image (shown below), which is what I’ve passed into the model, but you can test the model out with any image you’d like.

Receipt sample image provided by authors of Donut

You should get an output like

{'predictions': [{'menu': [{'cnt': '2', 'nm': 'ICE BLAOKCOFFE', 'price': '82,000'}, 
    {'cnt': '1', 'nm': 'AVOCADO COFFEE', 'price': '61,000'}, 
    {'cnt': '1', 'nm': 'Oud CHINEN KATSU FF', 'price': '51,000'}],
    'sub_total': {'discount_price': '19,400', 'subtotal_price': '194,000'}, 
    'total': {'cashprice': '200,000', 
    'changeprice': '25,400', 
    'total_price': '174,600'}}]}

(Note: If you’re curious like me and wanted to know what output the pretrained donut-base backbone would give you, I went ahead and tested that. It took a long while to produce an output before crashing because it took up too much RAM.)

Finetuning Donut on a custom dataset

To demonstrate fine-tuning, I’ll be using the SROIE dataset, a dataset of receipt and invoice scans along with their basic information in the form of a JSON as well as word-level bounding boxes and text. It contains 626 images, but I’ll only be training on 100 to demonstrate Donut’s effectiveness. It’s a smaller dataset than CORD (which contains ~1000 images), and also much fewer labels (only company, date, address, and total).

Downloading and parsing SROIE

To download the dataset, you’ll have to download just the data folder from the main repository. You can do this either by cloning the whole repository or by using something like Download Directory to download just the single folder.

But now we’d need to parse the dataset to the format required by the HuggingFace datasets library, which is what Donut uses under the hood to load the custom dataset as an image-string table. (If you’re looking for documentation, Donut uses the imagefolder loading script.)

Here’s the desired dataset format:

dataset_name
├── test
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
├── train
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
└── validation
    ├── metadata.jsonl
    ├── {image_path0}
    ├── {image_path1}
              .
              .

Where metadata.jsonl is a JSON lines document that looks like

{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}

In other words, we want to convert each document’s annotations (found in the key folder) into a ground truth JSON-dumped string that looks like "{\"gt_parse\": {actual JSON content}"}". Here’s an example annotation:

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
}

Here’s the script I used to transform the data into JSON lines files, as well as copy the images into their respective folders:

import os
import json
import shutil
from tqdm.notebook import tqdm
lines = []
images = []
for ann in tqdm(os.listdir("./sroie/key")[:100]):
  if ann != ".ipynb_checkpoints":
    with open("./sroie/key/" + ann) as f:
      data = json.load(f)
images.append(ann[:-4] + "jpg")
    line = {"gt_parse": data}
    lines.append(line)
with open("./sroie-donut/train/metadata.jsonl", 'w') as f:
  for i, gt_parse in enumerate(lines):
    line = {"file_name": images[i], "ground_truth": json.dumps(gt_parse)}
    f.write(json.dumps(line) + "\n")
shutil.copyfile("./sroie/img/" + images[i], "./sroie-donut/train/" + images[i])

I simply ran this script three times, changing the names of the folders and the list slice ([:100]) each time, so that I had 100 examples in train and 20 examples each in validation and test.

Training the model

The authors of Donut offer a very easy method to train the model. First, we’ll need to create a new config file in the donut/config folder. You can copy the example that’s already in there (train_cord.yaml) to a new file called train_sroie.yaml. These are the values I changed:

dataset_name_or_paths: ["../sroie-donut"]
train_batch_sizes: [1]
check_val_every_n_epochs: 10
max_steps: -1 # infinite, since max_epochs is specified

If you’ve locally downloaded the donut-base model, you can also specify the path to it in pretrained_model_name_or_path. Otherwise, HuggingFace will download it directly from the Hub.

I decreased the batch size from 8 as I got a CUDA out of memory error on Google Colab, and increased the check_val_every_n_epochs to 10 to save time.

And here is the line you should use to train your model:

cd donut && python train.py --config config/train_sroie.yaml

It took me around an hour to finish training on the GPU provided by Google Colab.

Inferencing using the fine-tuned model

Using a similar script to the above CORD demonstration, we can use

from donut import DonutModel
from PIL import Image
import torch
model = DonutModel.from_pretrained("./donut/result/train_sroie/20220804_214401")
if torch.cuda.is_available():
    model.half()
    device = torch.device("cuda")
    model.to(device)
else:
    model.encoder.to(torch.bfloat16)
model.eval()
image = Image.open("./sroie-donut/test/099.jpg").convert("RGB")
output = model.inference(image=image, prompt="<s_sroie-donut>")
output

Notice that we’ve changed the model path in the DonutModel.from_pretrained() call, and we’ve also changed the inference prompt to be in the format <s_{dataset_name}>. Here’s the image I used:

And these were my results:

{'predictions': [{'address': 'NO 290, JALAN AIR PANAS. SETAPAK. 53200, KUALA LUMPUR.',
   'company': 'SYARIKAT PERNIAGAAN GIN KEE',
   'date': '04/12/2017',
   'total': '47.70'}]}

Final thoughts

I’ve noticed that the final output using Donut’s pseudo-OCR is much more accurate than traditional off-the-shelf OCR methods. As an extreme example, here’s the same CORD document from the demonstration OCRed using Tesseract’s OCR engine:

*' il " i
- ' s ' -
W =
o o
ok S
?ﬂﬁ (€
rgm"f"; o ;
L i 4

The image was blurry, had low contrast, and was hard to read even for a human, so it’s unlikely that anyone would expect a model to be able to recognize the characters. It’s impressive that Donut is able to do so with its own techniques. Even with high-quality documents, while other commercial OCR models provide better results than open-source OCR engines such as Tesseract, they’re often costly and are only better because of intensive training on commercial datasets and more compute power.

Alternatives to a model parsing the OCR output of a given document include utilizing just computer vision techniques to highlight various blocks of text, parsing tables, or identifying images, figures, and mathematical equations, but once again require the user to OCR the bounding box-ed output if meaningful data can be derived. Libraries include LayoutParser and deepdoctection, both of which connect to a model zoo of Detectron2 computer vision models to deliver results.

Additionally, the authors of Donut also provide a testing script you can use to develop evaluation metrics for your finetuned model, located in the test.py file in the Donut codebase. It provides F1 accuracy scores, which are measured based on exact pass or fail to the ground truth parse, as well as an accuracy score given by a Tree Edit Distance algorithm, which determines how close the final JSON tree is to the ground truth JSON.

cd ./donut &&
python test.py --dataset_name_or_path ../sroie-donut --pretrained_model_name_or_path ./result/train_sroie/20220804_214401 --save_path ./result/train_sroie/output.json

With my SROIE-finetuned model, the mean accuracy across all 20 of my test images was 94.4%.

Donut also comes packaged with SynthDoG, which is a model that can be used to generate additional fake documents for data augmentation in four different languages. It was trained on the English, Chinese, Japanese, and Korean Wikipedias so as to better address problems with traditional OCR/VDU methods, which are often limited by the lack of large amounts of data in languages other than English.

[1] Kim, Geewook et al. “OCR-free Document Understanding Transformer.” (2021). MIT license.

[2] Zheng Huang, et al. “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction.” 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019. MIT license.

Neha Desaraju is a student at the University of Texas at Austin studying computer science. You can find her online at estaudere.github.io.