The world’s leading publication for data science, AI, and ML professionals.

Do You See What I See?

Applying Unsupervised Learning to Document Layout

Image by Gordon Johnson from Pixabay
Image by Gordon Johnson from Pixabay

Perception was the first unsupervised algorithm.

Take vision for example. Have you ever noticed how our brains automatically group things that are close in visual space? Notice how in the image below, we tend to see two separate groups of 6 dots rather than just an image of 12 total dots:

Image by author
Image by author

And this natural tendency to group isn’t unique to vision. Indeed, Morse code works because auditory perception is organized around similar principles.

In data science, unsupervised learning is where mathematical models attempt to discover naturally occurring patterns in data. One class of unsupervised learning is cluster analysis. Clustering algorithms work like our perceptual system in that distance helps to determine natural groupings in data.

But different clustering algorithms define and use distance in different ways to make decisions regarding which data belongs to which group. As a result, each algorithm may perform differently on the same clustering tasks.

In short, just because our eyes may see a clear cluster solution, doesn’t mean our algorithms will come to the same conclusion.

One area where our eyes may see patterns that clustering algorithms may struggle with is in documents. In the remainder of this short demonstration, I experiment with two different clustering algorithms (KMeans & DBSCAN) and test their ability to identify columnar structure in words on an image. Herein I ask these clustering algorithms; do you "see" what I see? Let’s find out.

Document Understanding

When we look at documents it is easy to experience the power of our perceptual systems at work. Resumes, forms, legal documents, and term papers all have organization to how the words are laid out on a page. In some instances, those forms may include physical demarcations of different sections using lines and grids. In other instances, merely the organization of the words is enough for our eyes to know which words go with which sections.

For machines, however, understanding the layout of these documents is no easy task. Indeed, researchers and companies have spent millions on sophisticated deep learning approaches to automatically identify different sections in documents.

Unfortunately, these deep learning approaches have some limitations including their compute costs, the need for manually labeled data, and sample size. In short, the emphasis has been on using supervised learning with deep neural net architectures to better understand document layout.

All this despite the often very clear sections we perceive with almost no effort with our eyes. Upon this realization, I decided to test out the ability of unsupervised cluster algorithms to detect basic document organization.

The Experiment

To do this, I selected two example documents, one with very clear columnar organization and a second with slightly more complex columnar organization. The first is an example resume and the second an example credit application form:

Image by author
Image by author

In both examples, you can see different columnar organization in each document. Two cluster algorithms are used to test their ability to accurately detect this column structure, KMeans and DBSCAN.

The KMeans algorithm works by randomly initializing a center point in n-dimensional space. The number of center points is determined by the person performing the cluster analysis prior to running the model. Each point is then assigned to a group based on its distance from its closest centroid. Once all points are grouped, the average of the group is computed to update the center. The process iterates until the centroid no longer changes significantly.

DBSCAN (density-based spatial clustering of applications with noise) on the other hand, works by starting with a point in n-dimensional space. The algorithm then groups all points that are within a certain distance of that starting point to form clusters. The degree of distance and the minimum number of points are the only two hyperparameters required for DBSCAN. Thus, DBSCAN does not have the same limitation as KMeans to know a-priori how many possible clusters might exist. That said, distance, or epsilon as it is called, often requires tuning.

To implement the test, I am using the following environment:

OS: Windows 10

Python 3.7.3

Scikit-Learn 1.0.1

OpenCV 4.5.1

Pandas 0.25.0

Matplotlib 3.3.2

Pytesseract 5.0.0

Code

To start, I first provide the path to the images and use the built-in Python library glob to list all .jpg images. I then use OpenCV to read in each image, pass the image to Pytesseract’s "image_to_data" method that returns a tab delimited string. The tab delimited string can be properly interpreted by the Pandas "read_csv" method into a dataframe by using the Python StringIO library.

Once the dataframe is created, I then drop the null values and add "right" and "bottom" columns to identify the right and bottom pixel locations of each word found. The final step before clustering is to normalize the "left/right" columns that will be used in clustering and put them in an array. All of this is captured in a list of tuples I have titled data.

The data list includes the name of the original image, the image array, the data frame from tesseract, and the normalized left/right column array.

Once the data required have been captured, I then move to test both DBSCAN and KMeans. In the code below we manually enter the image we want to generate output for by entering a value for the variable "an" and then run the code that follows to generate output. Here is an example:

To further explain the code above, we extract the following information from each tuple in the data list:

The left/right array for clustering (X)

The dataframe with the tesseract output (df)

We then pass the clustering array (X) to the either the DBSCAN (lines 13–15)or KMeans models (lines 18–23), generate and save scatter plots of the results (lines 26–64), join the dataframe to the cluster labels provided by the model (line 66), find the bounding boxes for the clusters (lines 68–71), and add those to the image before saving. Note that the code to run the model and generate the images was largely forked from here.

The Results

In both cases, we see that each algorithm appeared to perform similarly with some small but important variation in results. For the 3-column image (the credit application), both solutions found similar clusters but with slightly different distributions.

Image by author
Image by author

The biggest difference however was with the 2-column image (the resume), where KMeans appeared to lump some of the data in the second column with the first.

Image by author
Image by author

Upon evaluation of the bounding boxes for each solution we see that KMeans performed slightly better with the 3-column image whereas DBSCAN performed much better with the 2 column image.

Image by author
Image by author

Conclusion

The biggest limitation to KMeans is that the model has several hyperparameters that we can further experiment with to adjust the solution whereas DBSCAN only has 2 (epsilon and minimum number of points in a cluster). Therefore, DBSCAN is a bit easier to tweak to find results that may better fit the 3-column image more effectively.

In both cases, we see how unsupervised clustering algorithms can be leveraged to aid in identifying simple document columnar structure. Such analyses can be very useful if trying to align the output of OCR engines so that data are properly aligned with their column labels.

In short, our unsupervised approach to document layout shows how cluster algorithms can sort of "see" what our human eye sees but may also require a bit of tuning in order to get right.

Like engaging to learn about Data Science, career growth, or poor business decisions? Join me.


Related Articles