
As a Data Scientist, you should know how hard it is to make shine your results.
In this article, you will learn about the journey I took in my first job, the ideas I came up with, the trials and errors I encountered, to finally succeed to build an intuitive way to visualise and interact with thousands of documents efficiently. And at the end, you will see how easily you can now reproduce the visualizations and make them your own.
Table of Contents
– The journey of a first job – Read Heteroclite Documents – Preprocessing – The Power of Embeddings – Add Granularity – A New Vision – Choosing the Good Clustering Algorithm – Precision VS Information Dilemma – Continuous VS Discrete Zoom Level – Ward Linkage – A Map Of A Documents Worlds – Conclusion – Perspective – References
This is a storytelling about the creation and working of Doc2Map – a fast tutorial is available here if you just want to use Doc2map
The journey of a first job
As a data science student, finding your first internship without any experience can be very complicated. But sometimes, luck can help and that’s precisely what happened to my CV when it caught the attention of the Conseil d’État – the french supreme court for administrative justice.
The Conseil d’État challenged me to research and develop a tool for visualizing and exploring all the legal complaints it received (about 250 000 by year) in order to find similar complaints that may be processed together to improve the efficiency of the institution.

Because the documents are written in French and confidential – some cases are even involving Google – they will be replaced by a toy Wikipedia dataset from Kaggle.
Read Heteroclite Documents
The first problem was that the documents were not uniform type of PDF file, (some were plain text put in PDF files, and others were only image scans of paper documents), so they should be processed separately according to this diagram:

As you can see, the key criterion for deciding whetherf a PDF document is plain text or images from a scan is a threshold put on the average number of lines by page for one document.
According to my tests, the best choice to extract text and images from PDF is the Xpdf command-line tools (Pdf2Txt and PdfImages) – how sad there is no wrapper for python…
To OCR images, Tesseract stay the best choice available.
Note: Apache Tika may be a good alternative for a wider range of documents format, but as it is a java application, it is rather slow. So it was not the best for that specific use case. However, the public version of Doc2Map will use Tika to work with any type of documents.
Preprocessing
Strangely enough, it appears that sometimes classical NLP libraries are not always the best choice for a simple task. To remove stopwords, you may prefer the Stopwords-iso library and to lemmatization, just use the lemmatization-list.
Why not use spaCy? Because by default, it loads a lot of useless tools that will slow down the process. The other reason is that its lemmatizer does not seem to work well for converting nouns into singular radicals.
The Power of Embeddings
When asking how to visualize a big corpus of documents, one of the first ideas you may have is to use a document embedding method (Doc2Vec¹) to transform words into 300-dimensional vectors. ** Then, apply any dimensionality reduction technique (e.g. UMAP²) on the vectors to project them in a 2D** space.

With a good library and some tweaks, you can display a beautiful and interactive visualization and even bind events to open the file or the URL when a point is clicked:
Unfortunately, it didn’t seem that much helpful. Yes, you can have a global view, but you lack of granularity: you lack of intermediary pieces of information to understand the overall organisation of the documents.
You may imagine yourself reading a map of an unknown world where names of countries, states, and cities would have been erased and you can only see the different houses (in our case documents).
So, we now have two problems to address:
- How add granularity to our map?
- How find interesting names to add as intermediate-level pieces of information?
Add Granularity
After many searches, it appears that the best tool would be the MarkerCluster plugin of the Leaflet javascript library. It has many pros. Firstly, it is similar to Google Map, and so an intuitive tool. And secondly, it is probably the most beautiful and efficient library:
A New Vision
It is not an easy task to invent names for countries, states, and cities – how can we create names of countries, states, and cities? How can we create regroup our documents by clusters?
Short answer: Top2Vec³, an all-new Topic Modeling method published in 2020. Top2Vec cleverly combines solid methods according to this scheme: Word2Vec⁴+Doc2Vec¹ ⇒ UMAP² ⇒ HDBSCAN⁵.
Let’s see how it works:
- Word2Vec+Doc2Vec project documents and words into the same semantic space of 300 dimensions.
- UMAP project these vectors in a 2-dimensional space.

- HDBSCAN will create clusters by finding dense-packed areas and calculating their centroids.
- These centroids will define the position of a Topic and the nearest neighbor(s) will define the thematics of that topic.

To summarize, Top2Vec will find the positions of documents clusters on the map and link them with the nearest words on the map.

Note: There is a little trick I keep secret for an easier illustration. In truth, once the document clusters are created, we come back to the 300-dimensional space and only there find centroids and nearest words. The more dimensions, the more informative it is.
Choosing the Good Clustering Algorithm
HDBSCAN is a great clustering algorithm, it does a great job. However, it has two problems strongly linked together:
- HDBSCAN defines the areas not enough dense-packed as noise and it ignores them totally:

- HDBSCAN uses the single linkage tree method, which has the consequence of creating an unbalanced tree. The core idea of HDBSCAN is incompatible with the idea of dividing the map with clusters and subclusters, as the documentation explained:
For instance, HDBSCAN will tend to produce trees like this one:

But for our map, we would prefer a more informative tree more like this one, with more subclusters:

Precision VS Information Dilemma
There is a difficult trade-off to find between precision and information. Precision means fitting the best the denser-packed zone and information means adding more clusters.
We have to create a special tree (called a dendrogram) with on the y-axis the differents zoom level available for the user. With the following formula, we can attribute a zoom level to each clusters we found, according to their area:

Continuous VS Discrete Zoom Level
You may have already guessed the problem: we need a discrete integer zoom level, where as the formula give us a real number. All the more, there may be no cluster between two zoom levels or more than one:

In addition, for having a good Visualization, a cluster should not be displayed if its area is bigger than the area visible on the screen – it would be meaningless to show only one cluster to the user. So, if we replace the area metric with the zoom level metric, it means you should not display a cluster at a zoom level superior to the one calculated with the formula. (Intuitively, it means that on the graph, clusters can only go up.)
To address this complex problem with many constraints, it may require designing a brand new clustering algorithm to address this problem of finding the best trade-off, but it would be a long and difficult task.
Ward Linkage
To keep it simple, the algorithm that seems fitting best our criteria is the ward linkage hierarchical clustering, which produces a nice balanced cluster, as you will see in the next illustration.
Let’s see what it looks like to apply that method to our Wikipedia toy dataset, with red dots for the nodes and blue dots for the leaves (clickable):
As you can see, the tree is pretty well balanced, but it didn’t care about zoom level. So, we will have to force the clusters to be positioned only at the integers zoom levels. There again, there are many possibilities for choosing which clusters to keep or to delete. That tree was obtained by testing empirically different possibilities, so I will skip the details and go directly to the result:
Finally, we get a well-balanced tree ordered according to the zoom levels. So we can finally build a plot where the clusters change according to the zoom level chosen by the user:
In the next step, we will just have to link the zoom level of the cluster to the actual zoom level of the map, to create our final visualization.
A Map Of A Documents Worlds
With this pruned hierarchical tree, you can obtain a beautiful visualization to easily summarise your corpus of documents, et voilà:
To create that version, I had to create my own javascript plugin, to display pre-clustered data: https://github.com/louisgeisler/Leaflet.MarkerPreCluster
Conclusion
Doc2Map will make your work shine in people’s eyes, thanks to a new way of visualizing topic modeling which adds granularity that traditional techniques lack and allows you to display beautiful interactive views to let people use and understand how it works.
It uses solid and approved technologies such as Doc2Vec, UMAP, HDBSCAN.
The library Doc2Map is freely available on GitHub: https://github.com/louisgeisler/Doc2Map
And a tutorial about how using Doc2Map is just here: https://medium.com/@louisgeisler3/doc2map-the-best-topic-visualization-library-tutorial-82b603d1d357
Perspective
Machine learning seems everywhere today on the web, but it has yet to come locally in our machines. That is why, I have the strong conviction that in a near future, OS will directly include NLP and machine vision tools as a way of displaying and exploring files and folders.
Doc2Map is also an all-new way to explore a website – you can imagine a day where all websites such as Wikipedia or Medium would have a DocMap to help people to have a global view of the content.
Doc2Map can also easily be converted into an ‘Image2Map’ to organize and display images. In fact, you can replace Word2Vec and Doc2Vec with an image autoencoder, and instead of using words the closest to the centroid to describe cluster, you may directly use the centroid position as input for the decoder part of the autoencoder, to generate an image summarizing the content of one cluster.
References
[1] Mikolov Tomas, Efficient Estimation of Word Representations in Vector Space (2013), https://arxiv.org/abs/1301.3781.pdf
[2] McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018), https://arxiv.org/pdf/1802.03426.pdf
[3] Dimo Angelov, Top2Vec: Distributed Representations of Topics (2020), https://arxiv.org/abs/2008.09470.pdf
[4] Quoc Le, Tomas Mikolov, Distributed Representations of Sentences and Documents (2014), https://arxiv.org/pdf/1405.4053.pdf
[5] Ricardo J.G.B. Campello, Davoud Moulavi, and Joerg Sander, Density-Based Clustering Based on Hierarchical Density Estimates (2013), http://pdf.xuebalib.com:1262/2ac1mJln8ATx.pdf