The world’s leading publication for data science, AI, and ML professionals.

Using a Multimodal Document ML Model to Query Your Documents

Leverage the power of the mPLUG-Owl document understanding model to ask questions about your documents

This article will discuss the Alibaba document understanding model, recently released with model weights and datasets. It is a powerful model capable of performing various tasks such as document question answering, extracting information, and document embedding, making it a helpful tool when working with documents. This article will implement the model locally and test it out on different tasks to give an opinion on its performance and usefulness.

This article will discuss the latest model within document understanding. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com
This article will discuss the latest model within document understanding. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com

Table of contents

· Motivation · Tasks · Running the model locally · Testing of the modelDataTesting the first, leftmost receipt:Testing the second, rightmost receipt:Testing the first, leftmost lecture note:Testing the second, rightmost lecture note · My thoughts on the model · Conclusion

Motivation

My motivation for this article is to test out the latest machine-learning models that are publicly available. This model caught my attention since I have worked and am still working on machine learning applied to documents. I have also previously written an article on my work with a similar model called Donut that does OCR-free document understanding. I think the concept of having a document and asking visual and textual questions about it is awesome, so I spend time working with documents, understanding models, and testing their performance. This article is the second article in my series on testing out the latest machine-learning models, and you can read my first article on time series forecasting with Chronos below:

How to Effectively Forecast Time Series with Amazon’s New Time Series Forecasting Model

Tasks

The model can be used for various tasks mentioned in the GitHub repository. The most exciting task, in my opinion, is the document VQA task, where you can ask questions about a document, and DocOwl will give you answers. Furthermore, you can also do information extraction to extract meaningful information from the document. This is interesting if you, for example, are using the information extracted on other tasks requiring a specific data format, like storing the date and time of documents.

Furthermore, you can use the model for image captioning, where you ask the model to describe the visual contents of the document. This can be interesting for tagging documents to different categories or adding a textual description to images.

Running the model locally

To thoroughly test the model’s capabilities, you should download and run it locally. This allows you to use your own GPU, which will likely be faster than running the model on HuggingFace Spaces.

To use the model, I followed the instructions on this GitHub page. First, you need to clone the Git repo with:

git clone https://github.com/X-PLUG/mPLUG-DocOwl.git

You now have to download the model, which you can do in several ways. One way is to install Git LFS (Git large file system) and clone the model with Git, which is done with the following commands given on HuggingFace:

git lfs install
git clone https://huggingface.co/mPLUG/DocOwl1.5-Omni

Another way to download the model is to download the files directly from the HuggingFac model page.

Whichever approach you choose to download the model, you should end up with a folder of around 20 GB called DocOwl1_5-Omni. Note that this is the most capable model, but you can also download the models listed in order of least to most capable below:

  • DocOwl1.5-stage1
  • DocOwl1.5
  • DocOwl1.5-Chat

After downloading the model, you should go into the directory of the cloned Git repository and create a new file where you enter the code below. Ensure your new file exists in the directory /mPLUG-DocOwl/DocOwl1.5/ (where mPLUG-DocOwl is the folder created from running the git clone command).

from docowl_infer import DocOwlInfer
model_path="<path to folder containing the model>"
docowl=DocOwlInfer(ckpt_path=model_path, anchors='grid_9', add_global_img=True)
print('load model from ', model_path)

Then, you can query the model with

image='./DocDownstream-1.0/imgs/DUE_Benchmark/DocVQA/pngs/rnbx0223_193.png'
query='What is the Compound Annual Growth Rate (CAGR) for total assets?'
answer=docowl.inference(image, query)
print(answer)

Testing of the model

Data

I will use some of my own document data to test the model. First, I have two receipts:

I will use the receipts to query the model. Image by the author. The author also obtained the receipts.
I will use the receipts to query the model. Image by the author. The author also obtained the receipts.

The first leftmost receipt is a bit blurry somewhere, though the whole receipt is still readable. The second receipt is taken with a more zoomed-in picture and is a receipt that is easier for the model to interpret. You should note that both receipts are taken from a Norwegian supermarket, which can cause model issues, considering it is trained in English.

I will also test on some lecture notes which I have taken while studying at TU Delft:

Snippets of the documents I will use to query the model. Image by the author. Documents are also created by the author,
Snippets of the documents I will use to query the model. Image by the author. Documents are also created by the author,

The lecture notes have a small font size and some handwritten data, which could pose problems for the model.

The receipts and lecture notes are used because the receipts allow me to test the model using images of data (and not digitally born data), while the lecture notes allow me to test digitally born data, printed and handwritten.

I will then test the model on each of these four documents. Ideally, I would implement quantitative tests to test the model properly over a large sample size. In this case, however, since I am doing surface-level testing of the model, I will perform qualitative testing. Qualitative testing means that I will judge the model’s performance by manually querying the model and reflecting on the model’s responses. This will allow me to understand the works, though I can not come to any statistically significant conclusions on the model’s performance.

Testing the first, leftmost receipt:

Question: Which company is this receipt from?

DocOwl response: rema 1000

My thoughts: Correct, the model understands that the first name written in the largest font size is the company’s name and correctly reads the text "rema 1000," vastly different from the other text in the receipt.

Question: What is the most common item bought in the receipt? Give a detailed explanation.

DocOwl response: Most common item: Taffel Naturel The receipt shows that the most common item purchased was "Taffel Naturel," which appears in the list of items and their respective prices. It is listed under the category "TAFFEL" and is the only item that appears twice on the list, indicating that it was purchased by the customer more than once.

My thoughts: Here, I test the model’s ability to give a detailed response, and it manages to give close to the correct answer. The correct answer would be that there are two most common items: Taffel Naturell (it also misspells the word Naturell)

Testing the second, rightmost receipt:

Question: "What is the total sum of the items on the receipt?"

DocOwl response: 235.10

My thoughts: This is the correct answer. It’s impressive, but still, the model can find the answer by finding the number on the receipt.

Question: "How many items were bought on the receipt?"

DocOwl response: 8

My thoughts: This is a difficult question because the model must understand what items mean and then count the number of items. Furthermore, it must exclude all other sections, including the last line with the total. This is seriously impressive! The model is also only trained to respond to English documents and manages to answer correctly to Norwegian receipts.

Question: What date and time is the receipt from?

DocOwl response: 02.10.23 18:29

My thoughts: Again, precisely correct. Furthermore, I like DocOwl’s concise answers that are straight to the point. This contrasts my experience with, for example, Llama2, which always likes to add a lot of unnecessary text to its response.

Testing the first, leftmost lecture note:

Question: What are the headlines in the document, marked in bold?

DocOwl response: Semi gradient TD-learning: uses bootstrapping + learns faster -convergence for linear models -with Replay buffer -with target network Naivere online Q-learning: -not stable -temporal difference algorithm (TD error is bootstrapped) newQ = oldQ + a (r gamma) maxValue – ] det first QL larete on pb NTNU Online Q-learning: for ex: semi gradient Q learning Deep Q-networks: -off policy -category of online Q algorithms -stabilizes with RB and target networks -update frequency is crucial for efficiency/stability only use in discrete space, because max over of Q(s) is hard to solve in continuous space [/doc]

My thoughts: In this case, the model has misunderstood the task but added more information than it should. It answers with the headlines and some text under the headline. Furthermore, several writing errors in the response highlight the issues with OCR free-Document Understanding, which might have worked better with an OCR.

Question: What is the first successful deep RL algorithm?

DocOwl response: Online Q-learning.

My thoughts: This is an understandable response since it reads "successful deep RL algorithm" and expects the answer to be in the text after the words. Unfortunately, in this case, however, the correct response would be NFQ, given the text of the successful deep RL algorithm.

Testing the second, rightmost lecture note

Question: How many graphs are present in the document?

DocOwl response: 4

My thoughts: Unfortunately, this is not correct since the correct answer is two. However, **** the document is difficult to interpret, so I think this error is not critical.

Question: What do the green and blue colors mean?

DocOwl response: respectively, flow, service

My thoughts: This is impressive, considering the model has to understand the colors, read (my ugly) handwriting, and connect the dots between the colors and the handwriting. This shows the model has generalized well to different kinds of tasks

My thoughts on the model

In some cases, this is the most impressive model I have seen in quite a while. Primarily, it is about how the model performs on the receipts. However, the model struggles on the lecture notes, though the data I query the model with is understandably challenging. Still, the ability of the model to understand both simple and complex questions, give concise answers, as well as adapt to give more detailed answers makes this model seriously impressive. The model can be revolutionary in several application areas, like reading receipts or understanding documents. The model is flexible and, in most cases, gives precisely the answer you are looking for.

This makes the model available for chatting with documents and reliably outputs only the necessary information. This means you can know the structure of the model output, which allows you to utilize the model output in other tasks. An example is how the model outputs the date and time when prompted. You can expect the format to be concise, with only the necessary information to easily convert the model output to a DateTime object (assuming the date is the same in receipts).

If I were to criticize the model, it would only be its availability in English. Though it can work in other languages, like you saw it worked in Norwegian, it struggles with misspellings. The model gives less accurate responses when using a different language. This is naturally expected, but it is a downside to the model. This could be fixed in the future by making the model multilingual. On GitHub, the authors have stated that they plan to create a multilingual model with Chinese and English, but I need further information on other languages.

Conclusion

I have showcased one of the newest commercially available document understanding models in this article. I have discussed different tasks the model can be used for and how you can run the model locally. Furthermore, I performed a qualitative test by querying the model using different prompts and documents. This test highlighted that the model saw incredible performance on some tasks and performed exceptionally well on the receipts used to test the model. On more complicated data with lecture notes, the model struggled more, though this was expected due to the challenging nature of the data. Overall, DocOwl is a great multimodal document understanding model.


Related Articles