How to Implement Any Machine Learning Project with 3 Lines of Code

Concrete examples of everything you can do with Libra

Ugo Loobuyck
18 min readSep 9, 2020

--

Wouldn’t it be great to be able to solve complex machine learning problems quickly and without extensive knowledge of the amazing TensorFlow or PyTorch frameworks?

Well, Libra gives you that power, and today I will give you concrete examples of machine learning projects that you can easily implement with this exciting and elegant library, along with results.

In a previous article, I presented Libra, its features and downsides, so feel free to check it out before diving into examples!

Motivation

I got interested in Machine Learning a couple of years ago, but at the time things weren’t as easy as just creating a Python file, writing a bunch of lines of code and running it to be happy with results that you can visualize.

Instead, you had to go through all of PyTorch or Keras documentation to figure out how to put the components together and how they interact with each other, which was kind of overwhelming at first.

This article’s goal is to give you an overview of how easy it is to get acquainted with Machine Learning thanks to a framework that can ease your way in eventually learning complex processes.

Libra’s client class is central and implements a number of methods (called queries) that you can easily call to train models, infer results, etc.

Here is a list of the currently available models and tasks in Libra 1.1.2, which I will cover in more detail in this article:

  1. Neural networks (classification)
  2. Support Vector Machines
  3. Decision trees
  4. Nearest neighbors
  5. Neural networks (regression)
  6. XGBoost
  7. K-means clustering
  8. Convolutional neural networks
  9. Text classification
  10. Text summarization
  11. Caption Generation
  12. Text generation
  13. Named entity recognition

Along with each query, I will link some documentation so that you can have a more detailed view of the task/model. Let me specify that I am running every model on Google Colab.

I will try as much as possible to use a variety of datasets and problems to solve for the different tasks. Personally, Libra got me extremely excited with the NLP-related queries, so don’t hesitate to check out the queries that interest you the most!

Photo by Matt Duncan on Unsplash

Classification

1. Neural networks (classification)

Before anything, make sure to understand to core components of neural networks as well as their mechanisms. There is a ton of online documentation about feedforward neural networks in their most basic form (here, here and here), make sure to check it out.

To demonstrate classification, I will model constructiveness in Amazon reviews (4000 reviews) with Libra’s neural network. I have worked on a similar task in a previous article and in my master’s thesis (in which I describe the dataset in more details), in case you are interested.

This this what the dataset looks like:

You can perform classification in Libra with the classification_query_ann() method from the client object. It automatically calls a data reader, fits your data, evaluates performance on a subset of your data, and displays plots of the training process.

The following code calls the neural network. Notice that you can drop .csv columns with the drop argument and embed text columns with the text argument:

The Keras neural network is composed of an input Dense layer, a main Dense layer with 64 neurons and ReLU activation, and an output softmax layer. Pretty straight forward, unless…

After the first training, re-training are performed by adding more Dense layers to the architecture until we have maximized the validation accuracy.

Image by Author

The model is then available in your client object for predictions and analysis. You can of course run a new query to tweak the epochs hyperparameter, the resulting model will also be stored in the client object!

The analyze() method gives us statistics about best trained model:

Image by Author

Results seem to be rather good on the test set (20% of the original .csv file by default). Test statistics are really comprehensible and quite visual, which is another one of Libra’s strengths.

2. Support Vector Machines

SVMs have consistently been used to solve machine learning problems in the past decades, especially in regression and classification tasks.

They basically rely on the following simple geometric concept: draw a line (or hyperplane in higher dimensions) so that the space between the data points of each class in the dataset and this line is maximized.

Support Vectors are of course well documented (here, here). I will repeat it through the rest of the article, make sure to understand the math behind the models, it will help you on the long run.

Libra relies on Scikit-learn’s implementation in its query, which is widely used by data scientists and engineers. This great documentation will teach you what you need to know.

Once again, the strength of this framework is the wrapping technique that allows your to process your raw document all at once. Check the code to know more about the different query arguments.

I will again use the Amazon Reviews dataset to compare performance with the neural networks. The code below is very similar to the previous one, with drop and embed the same columns, and I simply switched the kernel from “linear” to “rbf”, sklearn’s default.

The following results show that the non-neural Support Vector Machine is slightly less effective with this hyperparameter combination than the feedforward neural network, on this particular dataset. I leave you the pleasure to tune your model as best as you can on your own dataset.

Image by Author

3. Decision trees

Continuing with classification, decision trees are widely used because of a great possibility to interpret results.

Indeed, unlike neural networks or other algorithms like SVM, decision trees rely on a series of simple YES/NO questions to progressively divide the data points in and make predictions, hence forming a tree shape.

A lot of literature is available online for decision trees(here, here), which are the base for more complex models like Gradient Boosting, which I will talk about later on.

Decision trees usually allow a series of interesting hyperparameter to tweak, like the maximum depth of the tree, the maximum number of nodes, etc. Since Libra uses Scikit-learn to train the model, you have access to these.

Image by Author

Okay, it seems like DT are not adapted to classifying something as complex and subjective as the constructiveness of a review. Neural networks usually manage to better capture textual relationship, so perhaps more feature engineering would boost the DT results.

4. Nearest neighbors

As its name suggests, the nearest neighbors algorithm relies on distances between data points. It can be used either for clustering (unsupervised), classification or regression (supervised), making it an extremely simple yet versatile model.

For more explanations about nearest neighbors, check out the great documentation by scikit-learn. Libra uses this framework to train and run a classification nearest neighbors algorithm, so I’ll only focus on this part here.

To illustrate the nearest neighbor algorithm in Libra, I will again use the Amazon review constructiveness dataset to compare results with the previously evaluated neural network, SVM and DT queries.

Don’t hesitate to tweak the available hyperparameters to your convenience and depending on your input data!

Image by Author

It looks like nearest neighbors are performing better than decision trees but worst than support vectors and neural networks on this particular dataset.

You can access the resulting model’s configuration with the following line of code:

client_classif.models["nearest_neighbor"]

Note that the test set varies from one model to another, since the Libra client does not keep track of the the test instance indices from one model to another.

Regression

5. Neural networks (regression)

When performing a regression task, we want to predict a continuous variable on unseen data.

Let’s use this dataset I found on Kaggle, where the variable to predict is the housing price in the King County, USA, depending on several variables. Here is a Pandas description of the set:

Image by Author

With Libra, you can either use the neural_network_query() method that will deduce by itself that you are requesting a regression task from your target variable. You can also use the regression_query_ann() directly.

Here is the code to load the data, train a feedforward neural network (running a Keras model and techniques similar to the classification NN under the hood), and analyze/visualize the results:

For the sake of convenience, I did not perform an in-depth EDA, feature engineering, imputing, etc. I simply removed the columns that I judged non-essential.

The query automatically imputes missing values and scales the different variables, meaning that you are relatively free to feed the network with non flawless datasets.

This is the output of the three lines of code:

Image by Author

Not bad! Libra outputs a series of testing statistics (MSE, MEA) as well a training/evaluation loss plot. Your model and several variables are automatically saved in your client instance, which makes it easy to reuse.

You can set aside part of the data for more inference, I let you try by yourself. Remember that Libra is still in development, so please be patient if you encounter bugs, or not yet implemented features.

6. XGBoost

If you know Kaggle and its competitions, then you must be somewhat familiar with the XGBoost and Gradient Boosting Machine algorithms (here is the difference between these two).

In the past years, boosting algorithms have largely contributed to result enhancement for regression and classification tasks, making them practically inevitable.

This amazing documentation explains everything you need to know to understand and use XGBoost.

Libra uses the XGBoost library adapted for the Scikit-learn API, so if you have already implemented and run a Gradient Boosting Machine with sklearn you shouldn’t be confused by the available hyperparameters.

Since XGB is extremely powerful on regression tasks, I will load and use the same data as with neural networks, about housing prices.

Unfortunately, at the time I write this article, the XGBoost query seems to be running indefinitely on Google Colab, so I am currently unable to provide an output for this. Stay tuned for results, or simply try it out on your own environment!

Clustering

7. K-means clustering

Clustering is an unsupervised machine learning technique used to group data points into an arbitrary number of clusters.

‘Unsupervised’ techniques assume that you data is unlabeled, therefore these algorithms won’t be precise when classifying data or modeling a continuous variable, but are very useful to detect patterns in the data, hence drawing the boundaries between several groups/clusters.

Although it is less popular than classification, QA or NER, it is still interesting to see how you can get insights on your data with 3 simple lines of code.

You can find more detailed information about the K-means clustering algorithm in this article. Basically, the “K” in K-means decides on the number of centroids to which the data points can be assigned.

I will use the Credit Card dataset found on Kaggle to hopefully find patterns in the data. Make sure to check it out and to try it out yourself.

The best would be to perform feature engineering in order to properly scale each feature, remove potential outsiders, imputing missing values, etc. Libra of course handles these parts in a generic way, but understanding features and how to get more out of them is a great exercise.

As usual, we can simply feed our client object with our unprocessed dataset. Since the task is unsupervised, you do not need to tell Libra which target variable you want to model, there is none!

Image by Author

The algorithm has found that 18 was the number of clusters that minimized inertia, which the the metric Libra uses for clustering.

Once again, as of today I was unable to perform inference with Libra models since the framework is still in development and unfinished, so let’s be patient.

Image Processing

8. Convolutional neural networks

https://gfycat.com/fr/smoggylittleflickertailsquirrel-machine-learning-neural-networks-mnist

If convolutional neural networks have been around for a while now and are still offering good results, it is not due to chance.

The convolution technique in neural networks, very well explained in this article, allows to pick up patterns in different types of data, more specifically images (hence videos) and text.

Libra implements CNNs for image classification, and is quite convenient in order to automatically preprocess your data. You can basically pass the paths to all your images to the query, and it will do the rest.

The documentation says that there are 3 possible modes to read your data depending on how your image data folders are arranged, if you have divided training and test sets, etc.

Another cool thing is that you can choose to use an already existing model architecture such as ResNet50, VGG19… along with pre-trained weights from ImageNet.

To demonstrate, I will use the data found in the Libra GitHub repository, which consists in pictures of 42 written letters (a, A, b, B, c, d in Libra 1.1.2). The amount of images is really small and therefore not sufficient to train a good or great image classification model, but this will do for now.

The show_feature_map argument allows you to actually see what’s going on inside each layer of the model, which is really nice and can inform you about the inner workings of a CNN (I do not include it here since the display is quite large).

As usual, you can access the trained model in your client object, along with test statistics.

Image by Author

Natural Language Processing

9. Text classification

Text classification is one of the most popular machine learning tasks for several reasons:

  • Text is everywhere, especially online, which gives us a good excuse to do things with it.
  • The task itself is rather rudimentary and asks for a simple output.
  • The data is unstructured by nature, and therefore challenging to process and learn from with non-complex tools.
  • The real world applications are numerous (sentiment analysis, spam detection, polarity detection, language detection, etc.)

What Libra calls text classification corresponds to sentiment analysis. It implements a Keras LSTM-based model, composed of an embedding layer, a recurrent layer with 128 units, and an output layer for classification.

I will use the super famous IMDb sentiment analysis dataset to illustrate Libra’s text classification query.

I had to adapt the dataset a tiny bit, shorten it and replace the target column name, but the essential is here.

Training typically takes a very long time even though the model is rather small, because Libra currently allows an argument “max_features” to limit the vocabulary size, but does not actually use this limit during tokenization, therefore causing errors when set too low and memory inefficiency when set high enough.

Image by Author

Well, the results are not great and this would deserve some hyperparameter tuning among other things, but you get the idea. Simply try it out yourself with your own dataset!

Once your model is trained and stored in the client object, you can actually use it with the predict() method or the classify_text() method (the former calls the latter), as such:

>>> text_class_client.classify_text("I loved this movie!!!")
# 'positive'

10. Text summarization

There are two main types of summarization:

No need to say that abstractive summarization is way harder to perform since it relies on making actual sense of what is happening in the text and encapsulate the whole thing in an understandable way.

With abstractive summarization we enter the world of AI, that doesn’t care about probabilities but only tries to reproduce humans’ ability to understand, create, reason.

With the rapid rise of Transformers and Transfer-learning, text summarization is making big steps forward (keep track of the field’s research here).

Libra uses the T5 pre-trained model (small) [1] implemented in HuggingFace’s transformers library to perform summarization.

Datasets for abstractive text summarization are currently hard to find, but you can already try it with the sample provided by Libra. Let’s download it from their GitHub repository and fine-tune the T5 model on it. On Google Colab, this is easily doable with:

!git clone https://github.com/Palashio/libra.git

Now let’s fine-tune that beast on the available couple of samples:

Keep in mind that this is merely an example aiming at teaching Libra’s mechanism, I am definitely not trying to reach state-of-the-art performance.

Nevertheless, the huge advantage of pre-trained transfer learning-based models is that even with very few fine-tuning, the model will still be able to have a global language understanding and will probably perform decently on most tasks.

Fine-tuning metrics and plots for text summarization

Inference time!

You can access the model in you client object with the get_summary() method. To illustrate, I will summarize the first 3 paragraphs from J.K. Rowling’s Harry Potter and the Sorcerer’s Stone.

J.K.Rowling — Harry Potter and the Sorcerer’s Stone

Here is the result:

I find it pretty impressive how the generation model handled coreferences. “…were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense” appears in both original and inferred samples, but the model used “the Dursleys” instead of “they” or “Mr. and Mrs. Dursley”.

Similarly, the pronoun“they” is used in the second sentence and refers to the noun phrase “the Dursleys” from the first sentence, hence forming perfectly ordered English sentences.

The rest is simply extracted from the original text, and unfortunately we currently have no control over the maximum output length. I strongly encourage you to read the paper [1] to understand how T5 works, and try it yourself with the Transformers library.

11. Caption Generation

Caption generation is a pretty interesting task that mixes image analysis and natural language generation, two of the currently hottest research topics.

Libra uses InceptionV3 to perform this task and implements a CNN-RNN architecture to infer textual descriptions from visual patterns (the CNN is the encoder while the RNN is the decoder).

This article explains well how the model works internally, as described in the original paper [2] Make sure to read it if you’re interested in the task and want to implement something of that kind!

Don’t hesitate to check the code directly in Libra’s repository to better understand how the caption generator is trained. This query contains a lot of custom code although it heavily relies on TensorFlow and Keras, which is interesting if you want to take it as an example to implement your own project.

Several datasets are publicly available for that task, like the VizWiz dataset or the COCO dataset, along with challenges.

Unfortunately, at the time of writing the implementation of the caption generation query is unsuccessful for me, but hopefully the issue will soon be solved. The implementation is supposed to work as follows:

You can pass a .csv file to the client object, which holds a column with paths to the images, and a column with captions.

Don’t hesitate to play with hyperparameters such as epochs, top_k (maximum number of words in the vocabulary), embedding_dim and units (number of recurrent units in the decoder).

12. Text generation

In my previous article about Libra, I used text generation as an example to show the framework’s user-friendliness.

Instead of spending hours learning how to use a transformer-based language models to generate text with TF or PyTorch, using the generate_text() query will allow you to call Open-AI’s GPT-2 pre-trained model (12 layers, 117M parameters) and generate as much text as you want, all that with one line of code.

You can load a file with the .txt extension as a base for text generation, or simply use the prefix query argument as the text seed.

Here’s a concrete example: I loaded the very beginning of Tolkien’s Silmarillion and called the text generation query.

And the results (the first paragraph of the following text is the input, the rest is generated by this pre-trained GPT-2):

J.R.R. Tolkien — The Silmarillion

Pretty amazing, right? Even though you can tell that something is wrong with the long term meaning of the story, the grammar is quite neat and it would be hard to tell if a human or a machine wrote each sentence individually.

You can also change the maximum output length, top-K selected words for next word prediction, etc. Libra’s GPT-2 uses Top-K sampling decoding coupled with Top-P sampling, on which you can find more information [xx].

13. Named entity recognition

Named entity recognition (NER) is a sequence labelling task aiming at detecting, chunking or extracting named entities from a text, such as places, names, organizations, genes, etc.

A lot of resources are available online (here, here) to understand the ins and outs of the tasks. A currently popular real life use case is related to medical world and aims at extracting medical entities from patient records, for example.

NER implemented in Libra is pretty straightforward, and simply loads an NER pipeline from HuggingFace’s Transformers library. Make sure to understand how NER models work though.

You can simply load any .csv that contains a textual column, and call the named_entity_query() method with an instruction that determines which column will be studied.

For demonstration I will just create a tiny dataset with Pandas.

Image by Author

As you can see from the results, the transformers pipeline uses BERT for TensorFlow along with a BERT tokenizer. You can also notice that for this (rather simple) text sample, entities were correctly attributed to tags (I-PER, I-LOC, etc).

What’s next?

Obviously there are many more ML tasks available out there, some of which are nowadays at the very center of the research, like Question Answering, for example.

I believe that the Libra development team has great projects for their framework and that, with some time on their hands, all current issues will be fixed.

More generally, more modularity and robustness in the implementation would be great additions to Libra, along with an inference system that should be as easy as the training system.

For instance, simply dividing your data and testing the different models in a couple more lines of code wouldn’t hurt, and would allow beginners and intermediates to actually visualize concrete results.

More thorough tests would also be welcome to avoid a good load of bugs, not implemented methods and dependency conflicts, and I guess that will simply happen as the project grows, so no worries on this side.

On the positive side, Libra’s creator indicated in a LinkedIn post that more state-of-the-art research oriented stuff would be added to the library, which is really promising and I’m excited to see all of this happen!

Conclusion

Libra offers a quite impressive collection of queries that make it relatively easy to get acquainted with different tasks, models and concepts (assuming that you don’t encounter any bug and that you have a well formatted CSV file for training and another one for inference).

Of course, reaching a good balance between generalization and simplicity is ambitious.

Do we actually want to be able to implement ANY machine learning task with ANY data in a few lines of code?

You might think that this leads to the “easy way” to do machine learning, and I definitely agree with you. Getting your hands dirty with frameworks like TF or PT will allow you to develop much better skills on the long run.

On the other hand, Libra grants access to state-of-the-art models and implementations to basically anyone and extremely fast, which I see as an excellent way get started and actually visualize what you’re doing.

Thank you so much for reading through this article, I hope you took as much pleasure reading as I did writing it. Going through all these models and tasks taught me so much and I’m grateful to have wonderful people getting interested in reading about ML, AI, NLP and so on. Take care and see you soon!

References

[1]: Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text Transformer. arXiv preprint arXiv:1910.10683.

[2]: Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).ISO 690

--

--

Ugo Loobuyck

Writing about interesting things you can do with machine learning, NLP and AI.