OCR-Free Document Data Extraction with Transformers (1/2)

Donut versus Pix2Struct on custom data

Published in

Towards Data Science

10 min readApr 28, 2023

Donut and Pix2Struct are image-to-text models that combine the simplicity of pure pixel inputs with visual language understanding tasks. Simply put: an image goes in and extracted indexes come out as JSON.

Recently I released a Donut model finetuned on invoices. Ever so often I get the question how to train with a custom dataset. Also, a similar model was released: Pix2Struct, it claims to be significantly better. But is that so?

Time to roll up my sleeves. I will show you:

how to prepare your data for finetuning Donut and Pix2Struct
the training procedure for both models
comparative results on an actual dataset

Of course I’ll provide the colab notebooks as well, for easy experimentation and/or replication from your end.

Dataset

To do this comparison, I need a public dataset. I wanted to avoid the usual ones for document understanding tasks such as CORD, had a look around and found the Ghega dataset. It’s quite small (~250 documents) and consists out of 2 types of documents: patent applications and datasheets. With the different types we can simulate a classification problem. Per type we have multiple indexes to extract. These indexes are unique for the type. Exactly what I need. Prof. Medvet from the machine learning lab at the university of Trieste graciously approved the usage for these articles.

The dataset seems to be quite old so it needs to be investigated if it still suits our goal.

First exploration

When you get a new set of data, you first need to get acquainted with how it is structured. Luckily the website’s detailed description aides us. This is the dataset file structure:

ghega-dataset
    datasheets
        central-zener-1
        central-zener-2
        diodes-zener
            document-000-123542.blocks.csv
            document-000-123542.groundtruth.csv
            document-000-123542.in.000.png
            document-000-123542.out.000.png
            document-001-123663.blocks.csv
            document-001-123663.groundtruth.csv
            document-001-123663.in.000.png
            document-001-123663.out.000.png
            ...
        mcc-zener
        ...
    patents
        ...

We can see two main subfolders for the two doctypes: datasheets and patents. One level lower we have subfolders that are not important by themselves, but they contain files that start with a certain prefix. We can see a unique identifier, e.g. document-000–123542 . For each of these identifiers we have 4 kinds of data:

The blocks.csv file contains info about bounding boxes. As Donut or Pix2Struct don’t use this info, we can ignore these files.
The out.000.png file is the postprocessed (deskewed) image file. As I would rather test on unprocessed files, I will ignore these as well.
The raw, unprocessed document image has the in.000.png suffix. That’s what we are interested in.
And finally the corresponding groundtruth.csv file. This contains indexes for this image that we consider the ground truth.

Here is a sample groundtruth csv along with the column description:

Case,-1,0.0,0.0,0.0,0.0,,0,1.28,2.78,0.79,0.10,MELF CASE
StorageTemperature,0,0.35,3.40,2.03,0.11,Operating and Storage Temperature,0,4.13,3.41,0.63,0.09,-65 to +200

 1. element type
 2. page of the label block (-1 if absent)
 3. x of the label block
 4. y of the label block
 5. w of the label block
 6. h of the label block
 7. text of the label block
 8. page of the value block (never absent!)
 9. x of the value block
10. y of the value block
11. w of the value block
12. h of the value block
13. text of the label block

So that means we are only interested in the first and last column. The first being the key and the last being the value. In this case:

KEY                   VALUE
Case                  MELF CASE
StorageTemperature    -65 to +200

So that means that for this document we will finetune the models to look for a ‘Case’ with value ‘MELF CASE’ and also to extract a ‘StorageTemperature’ that is ‘-65 to +200’.

Indexes

The following indexes exist in the groundtruth metadata:

data-sheets: Model, Type, Case, Power Dissipation, Storage Temperature, Voltage, Weight, Thermal Resistance
patents: Title, Applicant, Inventor, Representative, Filing Date, Publication Date, Application Number, Publication Number, Priority, Classification, Abstract 1st line

Looking at the quality of the ground truth and feasibility I choose to retain the following indexes:

elements_to_extract = ['FilingDate', 'RepresentiveFL', 'Classification', 'PublicationDate','ApplicationNumber','Model','Voltage','StorageTemperature']

Quality

For the image conversion to text, ocropus version 0.2 was used. Which means it dates to about the end of 2014. This is ancient in terms of data science, so does the groundtruth quality live up to our task?

For this I had a look at random images and compared the groundtruth with was actually written on the document. Here are two examples where the OCR was incorrect:

document-001–109381.in.000.png from Ghega dataset

The key Classification is set as BGSD 81/00 as ground truth. And it should be B65D 81/100.

document-003–112107.in.000.png from Ghega dataset

The key StorageTemperature says I -65 {O + 150 as ground truth, while we can see it should be -65 to + 150.

There are many such errors in the dataset. One approach is to correct these. Another to ignore. Since I will use the same data just for comparing both models, I choose the latter. Shall the data be used for production, you may want to choose the former option to get the best results.

(also note that these special characters could mess up the JSON format, I will come back to that topic later)

Donut dataset structure

What does the format of the data we need it to be in look like?
For finetuning the Donut model we need to have the data organized in one folder with all the documents as separate image files and one metadata file, structured as a JSON lines file.

donut-dataset
    document-000-123542.in.000.png
    document-001-123663.in.000.png
    ...
    metadata.jsonl

The JSONL file contains per image file a line like this:

{"file_name": "document-010-100333.in.000.png", "ground_truth": "{\"gt_parse\": { \"DocType\": \"patent\", \"FilingDate\": \"06.12.1999\", \"RepresentiveFL\": \"Manresa Val, Manuel\", \"Classification\": \"A47l. 5/28\", \"PublicationDate\": \"1139845\", \"ApplicationNumber\": \"99959528 .3\" } }"}

Let’s break down this JSON line. On the upper level we have a dict with two elements: file_name and ground_truth. Under the ground_truth key, we have a dict with key gt_parse. The value is in itself a dict with the key value pairs that we know on the document. Or even better: assign. Remember that the doctype is not necessarily present as text in the document. The term datasheet is not present as text on those documents.

Luckily pix2struct uses the same format for finetuning, so we can kill two birds with one stone. Once we have converted it in this structure, we can use it for finetuning Pix2Struct as well.

Conversion

For the conversion itself, I created a Jupyter notebook on colab. I decided to create a split into a train and validation set at this stage, as opposed to just before finetuning. This way, the same validation images will be used for both models and the results will be better comparable. One out of 5 documents will be used for validation.

With the above knowledge of the structure of the Ghega dataset, we can construe the conversion procedure as follows:

For every filename ending in in.000.png we take the corresponding groundtruth file and create a temporary dataframe object.
Beware that the groundtruth could be empty or doesn’t exist entirely. (e.g. for datasheets/taiwan-switching)
Next, we deduct the class from the subfolder: patent or datasheet .
Now we have to build the JSON line. For each element/index we want to extract, we check if it is in that dataframe and collect it. Then copy the image itself.
Do this for all images and at the end we have a JSONL file to write out.

In python it looks like this:

json_lines_train = ''
json_lines_val = ''

for dirpath, dirnames, filenames in os.walk('/content/ghega-dataset/'):
    for filename in filenames:
        if filename.endswith('in.000.png'):
          gt_filename = filename.replace('in.000.png','groundtruth.csv')
          gt_filename_path = os.path.join(dirpath, gt_filename)
          if not os.path.exists(gt_filename_path):    #ignore files in /ghega-dataset/datasheets/taiwan-switching/ because no groundtruth exists
            continue
          if os.path.getsize(gt_filename_path) == 0:  #ignore empty groundtruth files
            print(f'skipped {gt_filename_path} because no info in metadata')
            continue
          doc_df = pd.read_csv(gt_filename_path, header=None)
          #find the doctype, based on path
          if 'patent' in dirpath:
            type = 'patent'
          else:
            type = 'datasheet'
          #create json line
          #eg:
          #{"file_name": "document-034-127420.in.000.png", "ground_truth": "{\"gt_parse\": { \"DocType\": \"datasheet\", \"Model\": \"ZMM5221 B - ZMM5267B\", \"Voltage\": \"1.5\", \"StorageTemperature\": \"-65 to 175\" } }"}
          p2 = ''
          #add always first element: DocType
          p2 += '\\"' + 'DocType' + '\\": '
          p2 += '\\"' + type + '\\"'
          new_row = {'ImagePath': os.path.join(dirpath, filename), 'DocType' :type}
          ghega_df = pd.concat([ghega_df, pd.DataFrame([new_row])], ignore_index=True)
          #fill other elements if available
          for element in elements_to_extract:
            value = doc_df[doc_df[0] == element][12].tolist()
            if len(value) > 0:
              p2 += ', '
              p2 += '\\"' + element + '\\": '
              value = re.sub(r'[^A-Za-z0-9 ,.()/-]+', '', value[0])   #get rid of \ of ” and " in json
              p2 += '\\"' + value + '\\"'
              new_row = {'ImagePath': os.path.join(dirpath, filename), element :value}
              ghega_df = pd.concat([ghega_df, pd.DataFrame([new_row])], ignore_index=True)

          p3 = ' } }"}'

          json_line = p1 + p2 + p3
          print(json_line)

          #take ~20% to validation
          #copy image file and append json line
          if random.randint(1, 100) < 20:
            output_path = '/content/dataset/validation/'
            json_lines_val += json_line + '\r\n'
            shutil.copy(os.path.join(dirpath, filename), '/content/dataset/validation/')  
          else:
            output_path = '/content/dataset/train/'
            json_lines_train += json_line + '\r\n'
            shutil.copy(os.path.join(dirpath, filename), '/content/dataset/train/')  
       
#write jsonl files
text_file = open('/content/dataset/train/metadata.jsonl', "w")
text_file.write(json_lines_train)
text_file.close()
text_file = open('/content/dataset/validation/metadata.jsonl', "w")
text_file.write(json_lines_val)
text_file.close()

The ghega_df is a dataframe to do some sanity checks or statistical analysis on if wanted. I used it to check random samples if my converted data was actually correct.

Hiccups

Once converted, it looks like everything is all copacetic. But I want to get rid of the idea that everything usually runs from the first try. There are always small unexpected hiccups happening. Talking about the errors I encountered and showing the remedies is useful for anybody mimicking this whole process with their own dataset.

For example, after converting the dataset, I wanted to train the Donut model. Before I can do that I need to create a train dataset, like so:

train_dataset = DonutDataset("/content/dataset", max_length=max_length,
                             split="train", task_start_token="<s_cord-v2>", prompt_end_token="<s_cord-v2>",
                             sort_json_key=False, # dataset is preprocessed, so no need for this
                             )

And got this error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-13-7726ec2b0341> in <cell line: 7>()
      5 processor.feature_extractor.do_align_long_axis = False
      6 
----> 7 train_dataset = DonutDataset("/content/dataset", max_length=max_length,
      8                              split="train", task_start_token="<s_cord-v2>", prompt_end_token="<s_cord-v2>",
      9                              sort_json_key=False, # cord dataset is preprocessed, so no need for this

ArrowInvalid: JSON parse error: Missing a comma or '}' after an object member. in row 7

So it seems there is a problem with the JSON format in row 7. I copied that line and pasted it in an online JSON validator:

It however says it’s a valid JSON line. So let’s have a deeper look:

{
   "file_name":"document-012-108498.in.000.png",
   "ground_truth":"{\"gt_parse\": {\"DocType\": \"patent\"\"FilingDate\": \"15. Januar 2004 (15.01.2004)\",\"Classification\": \"BOZC 18/08,\",\"PublicationDate\": \"5. August 2004 (05.08.2004)\",\"ApplicationNumber\": \"PCT/AT2004/000006\"} }"
}

Did you spot the error? After some time I noticed there’s a missing comma between the DocType and FilingDate. It was however missing on all lines, so it’s unclear to me why it says line 7 has a problem. When I fixed this issue, I tried again and now it claims there’s a problem on line 17:

ArrowInvalid: JSON parse error: Missing a comma or '}' after an object member. in row 17

Here is line 17, do you spot the problem?

{"file_name": "document-007-103668.in.000.png", "ground_truth": "{\"gt_parse\": {\"DocType\": \"patent\",\"FilingDate\": \"18.12.2008\",\"RepresentiveFL\": \"Schubert, Siegmar\",\"Classification\": \"A47J 31/42 (2""6·"')\",\"PublicationDate\": \"12.08.2009\",\"ApplicationNumber\": \"08021980.1\"} }"}

It’s the unescaped quotation marks for the Classification element. To remedy this, I made a decision that all values will only be allowed to contain alphanumeric and a few special characters with this regex:

[^A-Za-z0-9 ,.()/-]+

This may affect the true performance badly, but from what I could see, any other characters were caused by OCR errors anyway. I suppose for the relative comparison between the models leaving them out doesn’t matter much.

Data preparation: done

It’s often overlooked and certainly underestimated, the importance of preparing data before training. With the steps above I have shown you how you can adapt your own data to be used by both Donut and Pix2Struct for key index extraction on documents. Common pitfalls were also remedied. The Jupyter notebook with all steps can be found here. We’re halfway there. The next step is to train both models on this dataset. I’m very curious how well they fare, but the comparison and training will be for a next article.

Hands-on: document data extraction with 🍩 transformer

My experience using donut transformers model to extract invoice indexes.

toon-beerten.medium.com

References:

OCR-free Document Understanding Transformer

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such…

arxiv.org

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and…

arxiv.org

Machine Learning Lab - Ghega dataset

Ghega-dataset: a dataset for document understanding and classification We provide here a labeled dataset which can be…

machinelearning.inginf.units.it

to-be/donut-base-finetuned-invoices · Hugging Face

Edit model card Based on Donut base model (introduced in the paper OCR-free Document Understanding Transformer by…

huggingface.co