
This article describes a novel Multi Page Document Classification solution approach, which leverages advanced machine learning and textual analytics to solve one of the major challenges in the Mortgage industry.
Abstract
Even in today’s technological era most of the business is done using documents and the amount of paperwork involved will vary from industry to industry. Many of these industries need to scan through scanned document images (which usually contains non-selectable text) to get the information for key index fields to operate their daily tasks.
To achieve this, the first major task is to index different types of documents, which later helps in extraction of information and meta-data from a variety of complex documents. This blog post will represent how Advanced Machine learning and NLP techniques can be leveraged to solve this major part of the puzzle, formally called Document Classification.
Update: I have continued on this topic and introduced a multimodal deep method which not only just uses text, but also the images. Do check it out after reading this article.
Multimodal Deep Multipage Document Classification using both Image and Text
Introduction
In the mortgage industry, different companies perform mortgage loan audits of thousands of people.
Each individual audit is performed on an assortment of documents, submitted as a bundle which is called a Loan Package. A package is a combination of scanned pages, which can vary from (100–400~) pages. There are multiple sub-components within the package which may consist of (1–30~) pages. Such sub-components are called Documents or document classes. The following table represents this visually.

Background and Problem Statement
Traditionally, while evaluating the loan audits, Document Classification is one of the major parts of the manual effort. The mortgage companies mostly outsource this work to third party BPO companies, which execute this task by using manual or partially automated classification techniques i.e rule engines, template matching. The underlying problem which is faced by the current implementations is that the Business Process Outsourcing (BPO) staff has to manually find and sort the documents present in the packages.
Although, some degree of automation is achieved by few third-party companies using keyword searches, regular expressions, etc. The accuracy and robustness of such solutions are questionable and their manual workload reduction is still not satisfactory. Keyword searches and regular expressions mean that these solutions need to account for every new document or document variations which are presented and also need to add rules for that. This in itself becomes a manual effort and only partial automation is achieved. There still remains a chance where the system might identify a document class to be "Doc A" but it is in fact "Doc B", because of common rules present in both. Additionally, there is no degree of certainty towards an identification. More often than not, manual verification is still required.
There are several hundred document types, the BPO staff needs to have a knowledge base of "how a certain document looks, and what are the different variations of the same document?", in order to classify documents. On top of that, if the manual work is too much, Human error tends to increase.
Objective
The document classification solution should significantly reduce the manual human effort. It should achieve a higher level of accuracy and automation with minimal human intervention
The solution approach which we will be discussing in this series of blogs is not only limited to the Mortgage industry, it can be applied where ever there are scanned document images, and sorting of such documents is required. A few of the possible industries are financial organizations, academia, research institutes, retail stores
Characteristics of the documents
In order to make a solution pipeline, the first step is to know what is the data and what are its different characteristics. Since we have been working in the mortgage domain, we will define the characteristics of data we process in the mortgage industry.
Within a package, there are many types of pages, but generally, these can be categorized into three types:
Structured | Consistent forms and templates

Unstructured | Textual, no formatting and tables

Semi-Structured | Hybrid of above two, may have partial structure

In terms of documents, the following are the characteristics which are observed in the data.
- The documents present in the packages are not in a consistent order. For example i.e. in one package document "A" might come after document "B" and in the other one it’s the other way around
- There are many variations of a same document class. One document class can have different looking variations, for example, a document class "A" page template/format might change for different US states. Within mortgage domain, these represent the same information, but have difference in formatting and contents. In other words, if "cat" is a document, different breeds of cats would be the "variations".
- The document types have different kinds of scanned deformities i.e. Noise, 2D and 3D rotations, Bad scan quality, Page orientation, Which messes up OCR for those documents.
Solution Methodology
In this section, we will abstractly explain how our solution pipeline works, and how each component or module comes together to produce an end-to-end pipeline. Following flow diagram of the solution.

Since the goal is to identify the documents within the package, we had to identify what kind of characteristics within a document, make it different from another one?. In our case, we decided that the text present in the document is the key, because intuitively we humans also do it this way. The next challenge was to figure out the location of the document within the package. In the case of multi-page documents, boundary pages (start, end) have the most significance. because using these pages, a range of documents can be identified.
Machine Learning Classes
In terms of Machine learning, we treated this problem as a classification problem. Where we decided to identify the first and last pages of each document. we categorized our Machine Learning Classes (ML classes) in three types:
- First Page Classes: These classes are the first pages of each document class, which will be responsible to identify the start of the document.
- Last Page Classes: These classes are the last pages of each document class, which will be responsible to identify the end of the document. These classes will be made only for the document classes which have samples with more than one page.
- Other Class: This class is a single class which contains the middle pages of all the document classes combined into one class. Having this class helps the pipeline in the later stages, it reduces the instances where a middle page of a document is classified as the first or last page of the same document, which intuitively is possible because there can be similarities between all the pages such as headers, footers and templates. This allows the model to learn more robust features.
Following diagram represents, how these different types of ML classes would look like in terms of package and documents

Machine Learning Engine
Once the ML classes are defined, the next step is to prepare the dataset for training the Machine Learning Engine (The data preparation part will be discussed in detail in the next sections). Following diagram explains the inner workings of the Machine Learning Engine, and is a more technical view for the solution pipeline.

Let’s step-by-step describe the different phases of the solution.
Step 1
- Package (which are in pdf format) is split into individual pages (images)
Step 2
- The individual pages are processed through an OCR (Optical Character Recognition), which extracts the text from the image and generates the text files. We have used a state-of-art OCR engine to produce the text in our case. There are many free online offerings of OCR which can be used in this step.
Step 3
- The text corresponding to each page is then passed to the Machine learning engine where the Text Vectorizer (Doc2vec) generates its feature vector representation, which essentially is a list of floats.
Step 4
- The feature vectors are then passed to the classifier (Logistic Regression). The classifier then predicts the class for each feature vector. Which are essentially one of the ML classes which we have previously discussed (first, last or other)
Additionally, the classifier returns the confidence scores for all the ML classes (the section on the most right of the diagram). For example let (D1,D2 ..) be the ML classes then for a single page the results may look like the following.

Post Processing
Once the whole package is processed, we use the results/predictions to identify the boundaries of the documents. The results contain the predicted class and the confidence scores of the predictions for all the pages of the package. See the following table

Following is the simple algorithm and steps which are used to identify the Document boundaries using the output from the Machine learning engine.

Solutions Details (Deep dive TL;DR)
Data Preparation
In pursuit of developing an end-to-end Document Classification pipeline. the very first, and arguably the most important step is data preparation because the solution is as good as the data it uses. The data we used for our experiments, were documents from the mortgage domain. The strategies we adopted, can be applied to any form of document datasets in a similar fashion. Following are the steps which were performed.
Definition : Document Sample is an instance of a particular document. Usually it is a (pdf) file containing only the pages of that document.
Step 1
- First step is to decide, which documents within a package are to be recognized and classified?. Ideally, all the documents which are present in packages should be selected. Once the document classes are decided we move onto the extraction part. In our case, we decided to classify (44) document classes.
Step 2
- To obtain the data set, we collected pdfs of several hundred packages, and manually extracted the selected documents from those packages. Once the document was identified in the package, the pages of that document were separated and concatenated together in the form of a pdf file. For example, if we had found "Doc A" from page 4 to 10 in a package. we would extract the 6 pages (4–10) and merge them into a 6-pager pdf. This 6-pager pdf constitutes a document sample. All the samples extracted for a particular document class was put into a seperate folder. Following shows the folder structure. We collected 300+ document samples for each document class. Each document class was given a unique identifier which we called "DocumentIdentifierID"

Step 3
- The next step is to apply OCR and **** extract text from all the pages present in the document samples. The OCR iterated on all the folders and generated excel files, having the extract text and some meta-data. Following shows the format of the excel files, Each row represents one page

Dataset Table with sample rows.
Loan Number, File Name : These are unique sample (pdf) identifiers. There are two (green, yellow) samples present in the table.
Document Identifier ID , Document Name : Represent the document class, which these samples belong to.
Page Count : Total number of pages present in one particular sample. (both samples have 2 pages)
Page Number : Is the ordered page number of each page within a sample.
IsLastPage : If 1, it means the page is the last page of that particular sample.
Page Text : Is the text returned from the OCR for that particular page.
Data Transformations
Once the data is generated in the above format, next step is to transform it. In the transformations phase, the data is converted/manipulated into the format which is essential for training a Machine Learning model. Following are the transformations which are applied to the dataset.
Step 1 | Generating ML classes
- First step of transformation is to generate first page, last page, and other page classes. To do this, Page Number and IsLastPage columns values are used. Following shows a conditional representation of the logic used.

- Moreover, below table represents the columns. Notice the yellow column where 6853 represents the first page class, 6853-last represents the last page class, while mid-pages are considered as Other class

Step 2 | Data Split for Training and Test the Pipeline
- Once the step 1 is complete, from that point on we only need two columns "Page Text" and "ML Class" to make the training pipeline. Other columns are used for testing evaluations.
- Next step is to split the data for training and testing the pipeline, The data is split in a way where 80% is used for training and 20% is used for testing. The data is also randomly shuffled, but in a stratified fashion for each class. For more information click the link.
Step 3 | Data cleaning and transformation
- The "Page Text" column which contains the OCR text for each page is cleaned, this process is applied on train and test both. Following are the processes which are performed.
- Case correction: All the text is converted to UPPER or lower case.
- Regex for non-alphanumeric characters: All the characters which are not alphanumeric are removed.
- Word Tokenization: All the words are tokenized, which means the one Page Text string becomes list of words
- Stopwords Removal: Stopwords are the words which are too common in the English language and might not be helpful in classifying the individual documents. For example words like "the", "is", "a". These words can also be domain specific. it can be used to remove redundant words, which are common in many different documents. i.e. in terms of finance or mortgage, the word "price" can occur in many documents.
Following tables show before and after transformations

Training Pipeline
In the previous Machine Learning Engine Section, we abstractly discussed the inner workings of the Machine Learning Engine. The two main components were.
- Text Vectorizer: In our case, we have used Doc2Vec
- Classifier Model: Logistic Regressor is used for classification.
Text Vectorizer (Doc2Vec)
Since the beginning of the Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. Which means, transforming textual information into a meaningful representation which is usually known as vectors (or array) of numbers. Research community has been developing different methods to perform this task. In our research and development we tried different techniques and found Doc2Vec to be the best amongst all.
Doc2Vec is based on Word2Vec model. Word2Vec model is a Predictive Vector Space Model. To understand Word2Vec, let us begin with Vector Space Models.
Vector Space Models (VSMs): Embeds words into a continuous vector space where semantically similar words are mapped to nearby points
Two Approaches for VSM:
- Count-Based Methods: Compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word (e.g. TFIDF)
- Predictive Methods: Predict a word from its neighbors in terms of learned small, dense embedding vectors (e.g. Skip-Gram, CBOW). Word2Vec and Doc2Vec belong to this category of models
Word2Vec Model
It is a ** computationally efficient predictive model for learning word embedding from raw text. Word2Ve**c can be created by using the following two models:
- Skip-Gram: Creates a sliding window around current word (target word). Then use current word to predict all surrounding words (the context words). (e.g. predicts ‘the cat sits on the’ from ‘mat’)
- Continuous Bag-of-Words (CBOW): Creates a sliding window around current word (target word). Then predict the current word from surrounding words (the context words). (e.g. predicts ‘mat’ from ‘the cat sits on the’)
For more details, read this article. it explains different aspects of it in detail.
Doc2Vec Model
This text vectorization technique was introduced in the scientific research paper Distributed Representations of Sentences and Documents. Moreover, further technical details can be found here.
Definition _| it ** is an unsupervised algorithm that learns fixed-length feature vector representatio**n from variable-length pieces of texts. Then these vectors can be used in any machine learning classifier to predict the classes label_.
It is similar to Word2Vec model except, it uses all words in each text file to create a unique column in a matrix (called it Paragraph Matrix). Then a single layer NN, like the one seen in Skip-Gram model, will be trained where the input data are all surrounding words of the current word along with the current paragraph column to predict the current word. The rest is same as the Skip-Gram or CBOW models.

The advantage of Doc2Vec model:
- On sentiment analysis task, Doc2Vec achieves new state-of-the-art results, better than complex methods, yielding a relative improvement of more than 16% in terms of error rate.
- On text classification task, Doc2Vec convincingly beats bag-of-words models, giving a relative improvement of about 30%.
Classifier Model (Logistic Regressor)
Once the text is converted to a vector format. it is ready for a machine learning classifier to learn the patterns present in the vectors of different document types and identify the correct distinctions. Since, there are many classification techniques which can be used here, we tried best of the bunch and evaluated their results. i.e. Random Forest, SVM, Multi-Layer Perceptron and Logistic Regressor. Many different parameters were tried for each classifier to obtain the optimal results. Logistic Regressor was found to be the best amongst all of these models.
Training Procedure
- Once the data is transformed. Firstly, we train the Doc2Vec model on the training split (as discussed in the data transformation section). –
- After the Doc2Vec model is trained. the training data is passed through it again, but this time the model is not trained, rather we infer the vectors for the training samples. The last step is to pass these vectors and the actual ML class label to the classification model (Logistic Regressor).
- Once the models are trained on the training data, the both models are saved to the disk, so that these can be loaded into memory to be used in testing and ultimate production deployment. Following diagram shows the basic flow of this collaborative scheme.

Testing & Evaluation Pipeline
Once the pipeline is trained(which includes both the Doc2Vec model and the Classifier), The following flow diagram shows how it is used to predict the document classes for the testing data split.

The transformed testing data is passed through the trained Doc2Vec Model, where the vector representations of all the pages present in the testing data are extracted and inferred. These vectors are then classified through the classifier which returns the predicted class and the confidence score for all the ML classes.
For the detailed evaluation of the Machine Learning Engine, we generate an excel file from the results. Following table shows the columns and the information generated in the testing phase.

Page Text, File Name, Page Number : These **** are the same columns we had in the data preparation stage, these are just taken as it is from the source dataset.
ground, pred : ground shows the actual ML class of that page, while pred shows the predicted ML class by the ML engine.
Trained classes columns: Columns in this section represent the ML classes on which the model was trained on and the confidence scores for those classes.
MaxProb, Range : MaxProb shows the max confidence score achieved by any of the columns in Trained classes section. See the red colored text, Range shows the range in which the MaxProb falls in.
Currently there are three levels of results evaluation.
- Cumulative Error Evaluation Metric
- Confusion Matrix
- Class level confidence scores analysis
Cumulative Error Evaluation Metric
This evaluation calculates two metrics, Accuracy and F1-Score. For more details check this blog. These provide us an abstract insight into goodness of the pipeline. The scores can be between (1–100). where higher number represents how good the pipeline is in classifying the documents. In our experiments, we got the following accuracy and f1-score.

Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
Essentially, it makes it easier to understand:
- Which classes are not performing well?
- What is the accuracy score of an individual class?
- Which classes are confused with each other?
The below plot represents the confusion matrix we generated after our testing. It is an embed link so click to view the confusion matrix.
Values on both X-axis (True labels) and Y-axis (Predicted labels) represents document classes which we trained on. The numbers within the cells show the percentage of testing dataset belonging to the class on the left and bottom.
The values at the diagonal represent the percentage of data where the predicted classes were correct. Higher percentage is better. i.e. if 0.99 then it means 99% of the testing data for that particular class was predicted correctly. All the other cells show wrong predictions and percentage shows how much a certain class was confused by the another class.
As it can be seen, that the model is able to correctly classify most of the ML classes with more than 90% accuracy.
Class level confidence scores analysis
Although the confusion matrix gives details about the class confusions, but it doesn’t represent the confidence scores of the predictions. Which in other words means
- "How confident the model is, when making a prediction about a document class?"
What is the need?
In the ideal situation, model should have high confidence when predicting a correct ML class, and low confidence when predicting a wrong ML class. But this is not a strict behavior and depends on many factors i.e. performance of a particular class, actual domain similarities between document classes etc. To evaluate, whether this behavior exist and confidence scores can be a useful indication of a true predictions, we devised an additional evaluation approach.
Approach
Since the task is to reduce the manual work, it was decided that only the predictions with high confidence will be chosen.This way wrong predictions will be not happen (because those wont have high confidence). Rest of the documents and pages will be verified manually by the BPO.
Threshold
In this step confidence scores of the classes are calculated and the threshold is defined, Threshold is a percentage i.e. 80%, 75% which **** is decided based on following conditions.
- What is the confidence score value where, wrong predictions are in insignificant numbers and true predictions are in higher numbers. In other words, It is about finding the sweet spot.
Following line plot shows the true positives (blue line) and false positives (red line).

X-axis shows the ML classes, and Y-axis shows the percentage of the testing data for a particular class, which is covered by true positives or false positives.
For example: in case of the the ML class 1330, true predictions cover almost 70% of the whole testing data-set for that class. Which means ML engine was able to predict 70% of the data right, with confidence score greater than 90%. Moreover the false positives covered only 1% of the testing data-set, which means only 1% of the test data was predicted wrong with confidence score higher than 90%.
Although, because of the threshold, sometimes we lose on true positives (when confidence score is less than threshold). But that is not as bad as the false positives with high confidence. Such pages/documents will be verified manually.
The previous plot is made with threshold (90% and above). In the following plot, threshold is (80% and above). Notice that even if the threshold is dropped to 80% the false positives do not increase, while true positives increase significantly. Which means, that between 90% and 80% thresholds, 80% is optimal.

While doing this analysis all the levels are checked i.e. 50%, 60%, 70%. The most optimal threshold is chosen using this evaluation metric.
Solution Features
- Fast Predictions | The classification time for one page is under (~300ms). if we include the OCR time, one page can be classified well under 1 second. Moreover, if multi-processing is adopted,
- High Accuracy | The current solution pipeline is able to identify and classify documents with high accuracy and high confidence. In most of the classes we get more than 95% accuracy.
- Labeled Data Requirements | Within our experiments we have observed that the pipeline can work good with most 300 samples per document class. (Like in the experiment we discussed in these blogs). But this is dependent on the variations and type of document class. Moreover, we see accuracy and confidence scores increasing with more sample counts.
- Confidence Score Threshold | The pipeline provides prediction confidence scores, which enables a tuning approach, and allows to tune between the True Positives and False Positives.
- Multi-Processing | The Doc2Vec implementation allows for multi-processing, Moreover our data transformation scripts are highly parallelized.
Conclusion
Machine learning and Natural Language Processing has been doing wonders in many fields, we see first hand, how it helped to reduce the manual effort and automated the task of Document Classification. The solution is not only fast, but also very accurate.
Because of the sensitive nature of data used in this process. The code base is not available. I will rework the codebase on some dummy data which will allow me to upload it to my github. Please follow me on github for further updates. Also check out some of my other projects 😉