
Document understanding is the first and most important step in document processing and extraction. It is the process of extracting information from an unstructured or semi-structured document to transform it into a structured form. This structured representation can then be used to support various downstream tasks such as information retrieval, summarization, classification, and so on. There are many different approaches to document understanding, but all of them share the same goal: to create a structured representation of the document content that can be used for further processing.
For semi-structured document such as invoices, receipts or contracts, Microsoft‘s layoutLM model has shown a great promise with the development of LayoutLM v1 and v2. For an in-depth tutorial, refer to my previous two articles "Fine-Tuning Transformer Model for Invoice Recognition" and "Fine-Tuning LayoutLM v2 For Invoice Recognition".
In this tutorial, we will fine-tune Microsoft’s latest LayoutLM v3 on invoices similar to my previous tutorials and we will compare its performance to the layoutLM v2 model.
LayoutLM v3
The main advantage of LayoutLM v3 over its predecessors is the multi-modal transformer architecture that combines text and image embedding in a unified way. Instead of relying on a CNN to do the image embedding, the document image is represented as a linear projections of image patches that are then linearly embedded and aligned with text tokens as shown below. The main advantage of this approach is the reduction in parameters needed and overall lower computation.

The authors show that "LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image centric tasks such as document image classification and document layout analysis".
Fine-tuning LayoutLM v3
Similar to my previous article, we will use the same dataset of 220 annotated invoices to fine-tune the layoutLM v3 model. To perform the annotations, we have used UBIAI since it supports OCR parsing, native PDF/image annotation and export in the right format that is compatible with LayoutLM model without the need of any post processing. In addition, you can fine-tune the layouLM model right in the UBIAI platform and auto-label your data with it, which can save a lot of manual annotation time.
Here is an excellent overview on how to use the tool to annotate PDFs and images:
After exporting the annotation file from UBIAI, we upload it to a google drive folder. We will use google colab for model training and inference.
The training and inference script can be accessed in the google colab below:
Training:
Inference:
- First step is to open a google colab, connect your google drive and install the Transformers package from huggingface. Note that we are not using the detectron 2 package to fine-tune the model on entity extraction unlike layoutLMv2. However, for layout detection (outside the scope of this article), the detectorn 2 package will be needed:
from google.colab import drive
drive.mount('/content/drive')
!pip install -q git+https://github.com/huggingface/transformers.git
! pip install -q git+https://github.com/huggingface/datasets.git "dill<0.3.5" seqeval
- Next, pull the preprocess.py script to process the ZIP file exported from UBIAI:
! rm -r layoutlmv3FineTuning
! git clone -b main https://github.com/UBIAI/layoutlmv3FineTuning.git
#!/bin/bash
IOB_DATA_PATH = "/content/drive/MyDrive/LayoutLM_data/Invoice_Project_mkWSi4Z.zip"
! cd /content/
! rm -r data
! mkdir data
! cp "$IOB_DATA_PATH" data/dataset.zip
! cd data && unzip -q dataset && rm dataset.zip
! cd ..
- Run the preprocess script:
#!/bin/bash
#preprocessing args
TEST_SIZE = 0.33
DATA_OUTPUT_PATH = "/content/"
! python3 layoutlmv3FineTuning/preprocess.py --valid_size $TEST_SIZE --output_path $DATA_OUTPUT_PATH
- Load the dataset post-process:
from datasets import load_metric
from transformers import TrainingArguments, Trainer
from transformers import LayoutLMv3ForTokenClassification,AutoProcessor
from transformers.data.data_collator import default_data_collator
import torch
# load datasets
from datasets import load_from_disk
train_dataset = load_from_disk(f'/content/train_split')
eval_dataset = load_from_disk(f'/content/eval_split')
label_list = train_dataset.features["labels"].feature.names
num_labels = len(label_list)
label2id, id2label = dict(), dict()
for i, label in enumerate(label_list):
label2id[label] = i
id2label[i] = label
- Define few metrics for evaluation:
metric = load_metric("seqeval")
import numpy as np
return_entity_level_metrics = False
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels,zero_division='0')
if return_entity_level_metrics:
# Unpack nested dictionaries
final_results = {}
for key, value in results.items():
if isinstance(value, dict):
for n, v in value.items():
final_results[f"{key}_{n}"] = v
else:
final_results[key] = value
return final_results
else:
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
}
- Load, train and evaluate the model:
model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base",
id2label=id2label,
label2id=label2id)
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
NUM_TRAIN_EPOCHS = 50
PER_DEVICE_TRAIN_BATCH_SIZE = 1
PER_DEVICE_EVAL_BATCH_SIZE = 1
LEARNING_RATE = 4e-5
training_args = TrainingArguments(output_dir="test",
# max_steps=1500,
num_train_epochs=NUM_TRAIN_EPOCHS,
logging_strategy="epoch",
save_total_limit=1,
per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH_SIZE,
learning_rate=LEARNING_RATE,
evaluation_strategy="epoch",
save_strategy="epoch",
# eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="f1")
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=processor,
data_collator=default_data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()
After training is done, the evaluation on the test dataset is performed. Below is the model score after evaluation:
{'epoch': 50.0,
'eval_accuracy': 0.9521988527724665,
'eval_f1': 0.6913439635535308,
'eval_loss': 0.41490793228149414,
'eval_precision': 0.6362683438155137,
'eval_recall': 0.756857855361596,
'eval_runtime': 9.7501,
'eval_samples_per_second': 9.846,
'eval_steps_per_second': 9.846}
The model was able to achieve an F1-score of 0.69, 0.75 recall and 0.63 precision.
Let’s run the model on a new invoice that is not part of the training dataset.
Inference using LayoutLM v3
To run the inference, we will OCR the invoice using Tesseract and feed the information to our trained model to run predictions. To simplify the process, we have created a custom made script with few lines of codes that lets you ingest the OCR output and run predictions using the model.
- First step, lets import few important libraries and load the model:
#drive mount
from google.colab import drive
drive.mount('/content/drive')
## install Hugging face Transformers library to load Layoutlmv3 Preprocessor
!pip install -q git+https://github.com/huggingface/transformers.git
## install tesseract OCR Engine
! sudo apt install tesseract-ocr
! sudo apt install libtesseract-dev
## install pytesseract , please click restart runtime button in the cell output and move forward in the notebook
! pip install pytesseract
# ! rm -r layoutlmv3FineTuning
! git clone https://github.com/salmenhsairi/layoutlmv3FineTuning.git
import os
import torch
import warnings
from PIL import Image
warnings.filterwarnings('ignore')
# move all inference images from /content to 'images' folder
os.makedirs('/content/images',exist_ok=True)
for image in os.listdir():
try:
img = Image.open(f'{os.curdir}/{image}')
os.system(f'mv "{image}" "images/{image}"')
except:
pass
# defining inference parameters
model_path = "/content/drive/MyDrive/LayoutLM_data/layoutlmv3.pth" # path to Layoutlmv3 model
imag_path = "/content/images" # images folder
# if inference model is pth then convert it to pre-trained format
if model_path.endswith('.pth'):
layoutlmv3_model = torch.load(model_path)
model_path = '/content/pre_trained_layoutlmv3'
layoutlmv3_model.save_pretrained(model_path)
- We are now ready to run predictions using the model
# Call inference module
! python3 /content/layoutlmv3FineTuning/run_inference.py --model_path "$model_path" --images_path $imag_path

With 220 annotated invoices, the model was able to correctly predict the seller name, dates, invoice number and Total price (TTC)!
If we look closely, we notice it made a mistake by considering the Laptop total price as the Total invoice price. Given the model’s score, this is not surprising and hints that more training data is required.
Comparing LayoutLM v2 vs LayoutLM v3
Apart from being less computationally intensive, does layoutLM V3 provide a performance boost compared to its v2 counter part? To answer this question we compare both model outputs of the same invoice. Here is the layoutLM v2 output as shown in my previous article:

We observe few distinctions:
- The v3 model was able to detect most of the keys correctly whereas v2 failed to predict invoice_ID, Invoice number_ID and Total_ID
- The v2 model incorrectly labeled Total price $1,445.00 as MONTANT_HT (means total price before tax in French) whereas v3 predicted the total price correctly.
- Both models made a mistake in labeling the laptop price as Total.
Based on this single example, layoutLM V3 is showing a better performance overall but we need to test on a larger dataset to confirm this observation.
Conclusion
By open sourcing layoutLM models, Microsoft is leading the way of digital transformation of many businesses ranging from supply chain, healthcare, finance, banking, etc.
In this step-by-step tutorial, we have shown how to fine-tune layoutLM V3 on a specific use case which is invoice data extraction. We have then compared its performance to the layoutLM V2 and an found a slight performance boost that is still need to be verified on a larger dataset.
Based on both the performance and computational gains, I would highly recommend to leverage the new layoutLM v3.
If you are interested to create your own training dataset in the most efficient and streamlined way, don’t hesitate to try out UBIAI OCR annotation feature here for free.
Follow us on Twitter @UBIAI5 or subscribe here!