The world’s leading publication for data science, AI, and ML professionals.

Five Open-Source Machine learning libraries worth checking out

A whirlwind tour of five libraries that could be a great addition to your Data Science stack

Photo by Daniel on Unsplash
Photo by Daniel on Unsplash

Open-source is the backbone of Machine Learning. They go hand in hand. The rapid advancements in this field wouldn’t have been possible without the contribution of the open-source fraternity. Many of the widely used tools in the machine learning community are open source. Every year more and more libraries get added to this ecosystem. In this article, I present a quick tour of some of the libraries that I recently encountered and which could be a great supplement to your machine learning stack.


1️⃣. HummingBird

Humminbird is a library for compiling trained traditional machine learning models into tensor computations. This means you can take advantage of hardware acceleration like GPUs and TPUs, even for traditional machine learning models. This is beneficial on several levels.

  • User can benefit from current and future optimizations implemented in neural network frameworks;
  • User can benefit from native hardware acceleration;
  • User can benefit from having a unique platform to support both traditional and neural network models;
  • User does not have to re-engineer their models.
High-Level Architecture of Hummingbird Library | Source: [official paper](http://Compiling Classical ML Pipelines into Tensor Computations for One-size-fits-all Prediction Serving)
High-Level Architecture of Hummingbird Library | Source: [official paper](http://Compiling Classical ML Pipelines into Tensor Computations for One-size-fits-all Prediction Serving)

Additionally, Hummingbird also provides a convenient uniform "inference" API following the Sklearn API. This allows swapping Sklearn models with Hummingbird-generated ones without having to change the inference code.

Hummingbird to convert your trained traditional ML | Image by Author
Hummingbird to convert your trained traditional ML | Image by Author

🛠 Github

🔬 Papers :

📋 Blog

Standardizing Traditional Machine Learning pipelines to Tensor Computation using Hummingbird

💻 Demo

Hummingbird’s syntax is very intuitive and minimal. To run your traditional ML model on DNN frameworks, you only need to import hummingbird.ml and add convert(model, 'dnn_framework') to your code. Below is an example using a scikit-learn random forest model and PyTorch as the target framework.

Using a scikit-learn random forest model and PyTorch as the target framework using Hummingbird | Image by Author
Using a scikit-learn random forest model and PyTorch as the target framework using Hummingbird | Image by Author

2️⃣. Top2Vec

Text documents contain a lot of information. Sifting through them manually is hard. Topic Modeling is one technique widely used in industry to discover topics in a large collection of documents automatically. Some of the traditional and most commonly used methods are Latent Dirichlet Allocation(LDA) and Probabilistic Latent Semantic Analysis(PLSA). However, these methods suffer from drawbacks like not considering the semantics or ordering of the words. Top2vec is an algorithm that leverages joint document and word semantic embedding to find topic vectors. Here is what the authors have to say:

This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that top2vec finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models. Even, Pre-trained Universal Sentence Encoders and BERT Sentence Transformer are available in encoding.

Once a Top2Vec model is trained, we can do the following:

Capabilities of Top2Vec | Image by Author
Capabilities of Top2Vec | Image by Author

🛠 Github

https://github.com/ddangelov/Top2Vec

🔬 Paper

Top2Vec: Distributed Representations of Topics

📜 Documentation :

https://top2vec.readthedocs.io/en/latest/index.html

💻 Demo

Here is a demo of training a Top2Vec model on the 20newsgroups dataset. The example has been taken from their official Github repo.

Demo of training a Top2Vec model on the 20newsgroups dataset | Image by Author
Demo of training a Top2Vec model on the 20newsgroups dataset | Image by Author

3️⃣. BERTopic

BERTopic is another topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. It also supports visualizations similar to LDAvis. Here is a quick summarization of the capabilities of BERTopic.

Capabilities of BERTopic | Image by Author
Capabilities of BERTopic | Image by Author

🛠 Github

https://github.com/MaartenGr/BERTopic

🔬 Documentation

https://maartengr.github.io/BERTopic/

📋 Blog

💻 Demo

A visualization of the topics generated after training aBERTopic model on 20newsgroups dataset

Visualize topics with BERTopic | Image by Author
Visualize topics with BERTopic | Image by Author

4️⃣. Captum

Captum is a model interpretability and understanding library for PyTorch. Captum means comprehension in Latin, and it contains general-purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad, and others for PyTorch models. In addition, it has quick integration for models built with domain-specific libraries such as torchvision, torch text, and others. Captum also provides a web interface called Insights for easy visualization and access to a number of our interpretability algorithms.

Interpreting text models: IMDB sentiment analysis using Captum | Image from documentation
Interpreting text models: IMDB sentiment analysis using Captum | Image from documentation

Captum is currently in beta and under active development!

🛠 Github

https://github.com/pytorch/captum

🔬 Documentation

https://captum.ai/

🎤 Slides

  • Their slides from NeurIPS 2019 can be found here
  • Their slides from the KDD 2020 tutorial can be found here.

💻 Demo

Here is how we can analyze a sample model on CIFAR10 via Captum Insights :

Analyzing a sample model on CIFAR10 via Captum Insights | Image by Author
Analyzing a sample model on CIFAR10 via Captum Insights | Image by Author

5️⃣. Annoy

Annoy stands for Approximate Nearest Neighbors. It is built in C++ but comes with bindings in Python, Java, Scala, R, and Ruby. Annoy is used to do (approximate) nearest neighbor queries in high dimensional spaces. Even though many other libraries perform the same operation, annoy comes with some great add ons. It creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data. Annoy, built by Erik Bernhardsson, is used at Spotify for music recommendations, where it is used to search for similar users/items.

We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern – Spotify

🛠 Github

https://github.com/spotify/annoy

🎤 Slides

💻 Demo

Here’s how we can use Annoy to find the 100 nearest neighbors.

Finding the 100 nearest neighbors using Annoy | Image by Author
Finding the 100 nearest neighbors using Annoy | Image by Author

Wrap Up

So these were the libraries that I found interesting, useful, and worth sharing. I’m sure you would like to explore them and see how you could use them in your area of work. Even though we already have innumerable libraries to tinker with, exploring new ones is always fun and knowledgeable.


👉 Interested in reading other articles authored by me. This repo contains all the articles written by me category-wise.


Related Articles