The world’s leading publication for data science, AI, and ML professionals.

The Most Common Evaluation Metrics In NLP

Learn These Metrics

Natural Language Processing Notes

Photo by Fleur on Unsplash
Photo by Fleur on Unsplash

Introduction

Whenever we build Machine Learning models, we need some form of metric to measure the goodness of the model. Bear in mind that the "goodness" of the model could have multiple interpretations, but generally when we speak of it in a Machine Learning context we are talking of the measure of a model’s performance on new instances that weren’t a part of the training data.

Determining whether the model being used for a specific task is successful depends on 2 key factors:

  1. Whether the evaluation metric we have selected is the correct one for our problem
  2. If we are following the correct evaluation process

In this article, I will focus only on the first factor – Selecting the correct evaluation metric.

Different Types of Evaluation Metrics

The evaluation metric we decide to use depends on the type of NLP task that we are doing. To further add, the stage the project is at also affects the evaluation metric we are using. For instance, during the model building and deployment phase, we’d more often than not use a different evaluation metric to when the model is in production. In the first 2 scenarios, ML metrics would suffice but in production, we care about business impact, therefore we’d rather use business metrics to measure the goodness of our model.

With that being said, we could categorize evaluation metrics into 2 buckets.

  • Intrinsic Evaluation – Focuses on intermediary objectives (i.e. the performance of an NLP component on a defined subtask)
  • Extrinsic Evaluation – Focuses on the performance of the final objective (i.e. the performance of the component on the complete application)

Stakeholders typically care about extrinsic evaluation since they’d want to know how good the model is at solving the business problem at hand. However, it’s still important to have intrinsic evaluation metrics in order for the AI team to measure how they are doing. We will be focusing more on intrinsic metrics for the remainder of this article.

Defining the Metrics

Some common intrinsic metrics to evaluate NLP systems are as follows:

Accuracy

Whenever the accuracy metric is used, we aim to learn the closeness of a measured value to a known value. It’s therefore typically used in instances where the output variable is categorical or discrete – Namely a classification task.

Precision

In instances where we are concerned with how exact the model’s predictions are we would use Precision. The precision metric would inform us of the number of labels that are actually labeled as positive in correspondence to the instances that the classifier labeled as positive.

Recall

Recall measures how well the model can recall the positive class (i.e. the number of positive labels that the model identified as positive).

F1 Score

Precision and Recall are complementary metrics that have an inverse relationship. If both are of interest to us then we’d use the F1 score to combine precision and recall into a single metric.

To delve deeper into these 4 metrics, have a read of the Confusion Matrix "Un-Confused".

Confusion Matrix "Un-confused"

Area Under the Curve (AUC)

The AUC helps us quantify our model’s ability to separate the classes by capturing the count of positive predictions which are correct against the count of positive predictions that are incorrect at different thresholds.

For a more in-depth reading on this metric check out Comprehension of the AUC-ROC curve.

Comprehension of the AUC-ROC curve

Mean Reciprocal Rank (MRR)

The Mean Reciprocal Rank (MRR) evaluates the responses retrieved, in correspondence to a query, given their probability of correctness. This evaluation metric is typically used in informational retrieval tasks quite often.

Read more about Mean Reciprocal Rank.

Mean Average Precision (MAP)

Similar to MRR, the Mean Average Precision (MAP) calculates the mean precision across each retrieved result. It’s also used heavily in information retrieval tasked for ranked retrieval results.

Ren Jie Tan wrote a really good article titled Breaking Down Mean Average Precision (mAP) that I’d recommend reading.

Root Mean Squared Error (RMSE)

When the predicted outcome is a real value then we use the RMSE. This is typically used in conjunction with MAPE – which we will cover next – in the case of regression problems, from tasks such as temperature prediction to stock market price prediction.

Read more about Root Mean Squared Error

Mean Absolute Percentage Error (MAPE)

The MAPE is the average absolute percentage error for each data point when the predicted outcome is continuous. Therefore, we use it to test evaluate the performance of a regression model.

Read more about MAPE.

Bilingual Evaluation Understudy (BLEU)

The BLEU score evaluates the quality of text that has been translated by a machine from one natural language to another. Therefore, it’s typically used for Machine-translation tasks, however, it’s also being used in other tasks such as text generation, paraphrase generation, and text summarization.

Jason Brownlee (Machine Learning Mastery) wrote a great article about the BLEU score titled A Gentle introduction to calculating the BLEU score for text in Python.

METEOR

The Metric for Evaluation of Translation with Explicit ORdering (METEOR) is a precision-based metric for the evaluation of machine-translation output. It overcomes some of the pitfalls of the BLEU score, such as exact word matching whilst calculating precision – The METEOR score allows synonyms and stemmed words to be matched with a reference word. This metric is typically used for Machine Translation tasks.

Read more about METEOR.

ROUGE

As opposed to the BLEU score, the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluation metric measures the recall. It’s typically used for evaluating the quality of generated text and in machine translation tasks – However, since it measures recall it’s mainly used in summarization tasks since it’s more important to evaluate the number of words the model can recall in these types of tasks.

Read more about ROUGE.

Perplexity

Sometimes our NLP models get confused. Perplexity is a great probabilistic measure used to evaluate exactly how confused our model is. It’s typically used to evaluate language models, but it can be used in dialog generation tasks.

Chiara Campagnola wrote a really good write-up about the perplexity evaluation metric that I’d recommend reading. It is titled Perplexity in Language Models.

Final Thoughts

In this article, I provided a number of common evaluation metrics used in Natural Language Processing tasks. This is in no way an exhaustive list of metrics as there are a few more metrics and visualizations that are used when solving NLP tasks. If you want me to delve deeper into any one of the metrics, please leave a comment and I will get to work on making that happen.

Thank you for reading! Connect with me on LinkedIn and Twitter to stay up to date with my posts about Data Science, Artificial Intelligence, and Freelancing.

Related Articles

Deep Learning May Not Be The Silver Bullet for All NLP Tasks Just Yet

Never Forget These 8 NLP Terms

5 Ideas For Your Next NLP Project


Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.