
Open-source is the backbone of Machine Learning. They go hand in hand. The rapid advancements in this field wouldn’t have been possible without the contribution of the open-source fraternity. Many of the widely used tools in the machine learning community are open source. Every year more and more libraries get added to this ecosystem. In this article, I present a quick tour of some of the libraries that I recently encountered and which could be a great supplement to your machine learning stack.
1️⃣. HummingBird
Humminbird is a library for compiling trained traditional machine learning models into tensor computations. This means you can take advantage of hardware acceleration like GPUs and TPUs, even for traditional machine learning models. This is beneficial on several levels.
- User can benefit from current and future optimizations implemented in neural network frameworks;
- User can benefit from native hardware acceleration;
- User can benefit from having a unique platform to support both traditional and neural network models;
- User does not have to re-engineer their models.
](https://towardsdatascience.com/wp-content/uploads/2021/06/1qOyFrQtqbmNCJ6ikXVTZCA.png)
Additionally, Hummingbird also provides a convenient uniform "inference" API following the Sklearn API. This allows swapping Sklearn models with Hummingbird-generated ones without having to change the inference code.

🛠 Github
🔬 Papers :
- A Tensor Compiler for Unified Machine Learning Prediction Serving.
- Compiling Classical ML Pipelines into Tensor Computations for One-size-fits-all Prediction Serving.
📋 Blog
Standardizing Traditional Machine Learning pipelines to Tensor Computation using Hummingbird
💻 Demo
Hummingbird’s syntax is very intuitive and minimal. To run your traditional ML model on DNN frameworks, you only need to import hummingbird.ml
and add convert(model, 'dnn_framework')
to your code. Below is an example using a scikit-learn random forest model and PyTorch as the target framework.

2️⃣. Top2Vec
Text documents contain a lot of information. Sifting through them manually is hard. Topic Modeling is one technique widely used in industry to discover topics in a large collection of documents automatically. Some of the traditional and most commonly used methods are Latent Dirichlet Allocation(LDA) and Probabilistic Latent Semantic Analysis(PLSA). However, these methods suffer from drawbacks like not considering the semantics or ordering of the words. Top2vec is an algorithm that leverages joint document and word semantic embedding to find topic vectors. Here is what the authors have to say:
This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that top2vec finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models. Even, Pre-trained Universal Sentence Encoders and BERT Sentence Transformer are available in encoding.
Once a Top2Vec model is trained, we can do the following:

🛠 Github
https://github.com/ddangelov/Top2Vec
🔬 Paper
Top2Vec: Distributed Representations of Topics
📜 Documentation :
https://top2vec.readthedocs.io/en/latest/index.html
💻 Demo
Here is a demo of training a Top2Vec model on the 20newsgroups dataset. The example has been taken from their official Github repo.

3️⃣. BERTopic
BERTopic is another topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. It also supports visualizations similar to LDAvis. Here is a quick summarization of the capabilities of BERTopic.

🛠 Github
https://github.com/MaartenGr/BERTopic
🔬 Documentation
https://maartengr.github.io/BERTopic/
📋 Blog
💻 Demo
A visualization of the topics generated after training aBERTopic
model on 20newsgroups dataset

4️⃣. Captum
Captum is a model interpretability and understanding library for PyTorch. Captum means comprehension in Latin, and it contains general-purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad, and others for PyTorch models. In addition, it has quick integration for models built with domain-specific libraries such as torchvision, torch text, and others. Captum also provides a web interface called Insights for easy visualization and access to a number of our interpretability algorithms.

Captum is currently in beta and under active development!
🛠 Github
https://github.com/pytorch/captum
🔬 Documentation
🎤 Slides
- Their slides from NeurIPS 2019 can be found here
- Their slides from the KDD 2020 tutorial can be found here.
💻 Demo
Here is how we can analyze a sample model on CIFAR10 via Captum Insights :

5️⃣. Annoy
Annoy stands for Approximate Nearest Neighbors. It is built in C++ but comes with bindings in Python, Java, Scala, R, and Ruby. Annoy is used to do (approximate) nearest neighbor queries in high dimensional spaces. Even though many other libraries perform the same operation, annoy comes with some great add ons. It creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data. Annoy, built by Erik Bernhardsson, is used at Spotify for music recommendations, where it is used to search for similar users/items.
We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern – Spotify
🛠 Github
https://github.com/spotify/annoy
🎤 Slides
💻 Demo
Here’s how we can use Annoy to find the 100 nearest neighbors.

Wrap Up
So these were the libraries that I found interesting, useful, and worth sharing. I’m sure you would like to explore them and see how you could use them in your area of work. Even though we already have innumerable libraries to tinker with, exploring new ones is always fun and knowledgeable.
👉 Interested in reading other articles authored by me. This repo contains all the articles written by me category-wise.