The world’s leading publication for data science, AI, and ML professionals.

AutoNLP: Automatic Text Classification with SOTA Models

A step-by-step guide to understanding and using AutoNLP from scratch

Figure 1. AutoNLP | Image by author | Icon taken from freepick
Figure 1. AutoNLP | Image by author | Icon taken from freepick

Developing an end-to-end Natural Language Processing model is not an easy task. This is because several factors must be considered, such as model selection, data preprocessing, training, optimization, the infrastructure where the model will be served among other factors. For this reason, interesting alternatives are emerging today to streamline and automate this set of tasks. Such is the case with AutoNLP, a tool that allows us to automate the end-to-end life cycle of an NLP model.

In this blog, we will see what AutoNLP is and how to use it well as the installation process, project creation, training, metrics, cost estimation, and model serving. So this blog will cover the following sections:

  • What is AutoNLP?
  • AutoNLP in practice

What is AutoNLP?

AutoNLP [1] is a tool to automate the process of creating end-to-end NLP models. AutoNLP is a tool developed by the Hugging Face [2] team which was launched in its beta phase in March 2021. AutoNLP aims to automate each phase that makes up the life cycle of an NLP model, from training and optimizing the model to deploying it.

"AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem." – AutoNLP team

One of the great virtues of AutoNLP is that it implements state-of-the-art models for the tasks of binary classification, multi-class classification, and entity recognition, supported in 8 languages ​​which are: English, German, French, Spanish, Finnish, Swedish, Hindi, and Dutch. Likewise, AutoNLP takes care of the optimization and fine-tuning of the models. In the security and privacy part, AutoNLP implements data transfers protected under SSL, also the data is private to each user account.

As we can see, AutoNLP emerges as a tool that facilitates and speeds up the process of creating NLP models. In the next section, will see how the experience was like from start to finish when creating a Text Classification model using AutoNLP.

AutoNLP in practice

For this example, we are going to tackle a binary classification problem. The dataset was taken from the Semantic Analysis at SEPLN (TASS) [3] workshop, which consists of tweets in Spanish labeled with 3 classes negative, positive and neutral. For the purpose of this example, we will remove the samples from the neutral class to have a dataset that fits a binary classification problem. You can download the dataset here. In Figure X we can observe the characteristics of the training and validation datasets.

Figure 2. Training and Validation datasets | Image by author
Figure 2. Training and Validation datasets | Image by author

Now, we don’t have to do anything else to the dataset. What follows is to start using AutoNLP, so let’s go for it!

Installing AutoNLP

To make use of the Hugging Face infrastructure through the AutoNLP tool, we need to register and create an account in which our models will be contained as well as our datasets. This account will provide a token that will be used to establish communication between AutoNLP CLI and the Hugging Face infrastructure.

Install AutoNLP can be done directly from the command line via pip , as follows:

Figure 3. Installing autonlp | Image by author
Figure 3. Installing autonlp | Image by author

We are also going to require installing Git Large File Storage (Git LFS). In my case since I am working on a macOS operating system, I do it as follows:

Figure 4. Installing git-lfs | Image by author
Figure 4. Installing git-lfs | Image by author

Then, for setting up git-lfs you need type:

Figure 5. Setting up git-lfs | Image by author
Figure 5. Setting up git-lfs | Image by author

Once autonlp is installed and its requirements, we proceed to create a project, upload the data and train our model, let’s go for it.

Creating an AutoNLP project

The first step to create a project is to authenticate, for this, we will only need the token that is in the settings of your account, using autonlp CLI we type:

Figure 6. Logging | Image by author
Figure 6. Logging | Image by author

Once authenticated, we proceed to create our project. For this example, our project will be called polarity_detection , which will process data in the spanish language, the task will be binary_classification and the maximum number of models we want to train is 5 . Then the command looks like the following figure:

Figure 7. Creating a project | Image by author
Figure 7. Creating a project | Image by author

When creating our project, the terminal will show us information regarding our created project. The information of our example project is shown in Figure 8.

Figure 8. Project description | Image by author
Figure 8. Project description | Image by author

As we can see, the information shows us details such as the identifier of our project (in this example it is 128 ), attributes such as the name , owner , etc. However, a very important piece of information that is also displayed is the cost. At this stage, the cost still shows USD 0.00 because we have not yet trained any model, therefore, this value will change only when the training of our models has finished, which we will see in detail later.

Uploading your data

Once our project is created, we will proceed to upload our datasets. It is recommended to upload the training and validation datasets separately. To upload the datasets it is enough to assign the name of the project (in our case it is polarity_detection ), the type of split (that is, train or valid ), the mapping of the name of the columns (in our case we have tweet and polarity which are mapped to text and target respectively) and the dataset file. The following figure shows how the commands would look:

Figure 9. Uploading datasets | Image by author
Figure 9. Uploading datasets | Image by author

As with the creation of the project, when uploading datasets in the terminal we will be shown information relevant to our process, in this case, the information of our datasets is shown in the following figure:

Figure 10. Dataset information | Image by author
Figure 10. Dataset information | Image by author

Once the datasets are uploaded, we would be ready to start the training. However, an important aspect to consider is the cost. AutoNLP provides us with a command to estimate the cost of our project considering the number of samples in our training dataset. To obtain the estimate, we use the command shown in the following figure:

Figure 11. Cost estimation | Image by author
Figure 11. Cost estimation | Image by author

As we can see, the cost is provided in a range, for this example, the cost is estimated to be 7.5 to 12.5 USD . The final cost is provided once the training ends, which we will see below.

Training

To start with the training of the models, we only need to use the train argument as well as the project name, as shown in the following figure:

Figure 12. Training | Image by author
Figure 12. Training | Image by author

Once the training has been launched, we will be shown a question about whether we agree with the estimated cost (that is, the cost range that we saw in the previous section), when accepting the estimated cost, we will observe the status of each model as in the following figure:

Figure 13. Training status | Image by author
Figure 13. Training status | Image by author

Depending on the number of models we have launched as well as the characteristics of our dataset, the training time will vary. However, we can monitor the status of our models with the project_info argument, as shown in the following figure:

Figura 14. Project information | Image by author
Figura 14. Project information | Image by author

As can be seen in the previous image, each of the models launched has finished satisfactorily (remember that we only launched 5 models). Also, the updated final cost is shown, which is 11.70 USD.

During the training of each model, we can monitor some metrics, for practical purposes of this example, we only show the metrics obtained at the end of the training of all the models. Therefore, to visualize the metrics we make use of the metrics argument as well as the name of the project, as shown below:

Figure 15. Metrics | Image by author
Figure 15. Metrics | Image by author

The reported metrics are loss , accuracy , precision , recall , AUC and f1-score for each trained model. In our case, we see that on average the accuracy is 0.85 , which could be an acceptable value given the characteristics of our dataset. We can also see that it shows us the total cost of our training again.

Inferring

Once our models are trained, they are ready to make predictions. In order to make predictions, we are going to use the predict argument as well as the model identifier, the name of the project and the phrase to be predicted, as shown in the following figure:

Figure 16. Predictions | Image by author
Figure 16. Predictions | Image by author

As we can see in the previous figure, the sentence pretends to be positive and indeed, the model yielded a score of 0.99 for the positive class. In the second example, the sentence pretends to be negative and indeed, the model yielded a score of 0.99 for the negative class.

Likewise, AutoNLP allows making predictions through a cURL request and through Python API as shown in Figures 17 and 18 respectively.

Figure 17. Predicting by request| Image by author
Figure 17. Predicting by request| Image by author
Figure 18. Predicting by Python API | Image by author
Figure 18. Predicting by Python API | Image by author

Conclusion

In this tutorial blog, we saw what AutoNLP is, its components, and how it works.

It is important to mention that once we have used the Hugging Face infrastructure through AutoNLP, we will receive an invoice with the amount shown on the command line.

My experience testing AutoNLP was pleasant. I had some problems when training for the first time however the support they provide is efficient.

References

[1] AutoNLP

[2] Hugging Face

[3] Semantic Analysis at SEPLN


Related Articles