The Most Common Evaluation Metrics In NLP

Natural Language Processing Notes

Introduction

Whenever we build Machine Learning models, we need some form of metric to measure the goodness of the model. Bear in mind that the "goodness" of the model could have multiple interpretations, but generally when we speak of it in a Machine Learning context we are talking of the measure of a model’s performance on new instances that weren’t a part of the training data.

Determining whether the model being used for a specific task is successful depends on 2 key factors:

Whether the evaluation metric we have selected is the correct one for our problem
If we are following the correct evaluation process

In this article, I will focus only on the first factor – Selecting the correct evaluation metric.

Different Types of Evaluation Metrics

The evaluation metric we decide to use depends on the type of NLP task that we are doing. To further add, the stage the project is at also affects the evaluation metric we are using. For instance, during the model building and deployment phase, we’d more often than not use a different evaluation metric to when the model is in production. In the first 2 scenarios, ML metrics would suffice but in production, we care about business impact, therefore we’d rather use business metrics to measure the goodness of our model.

With that being said, we could categorize evaluation metrics into 2 buckets.

Intrinsic Evaluation – Focuses on intermediary objectives (i.e. the performance of an NLP component on a defined subtask)
Extrinsic Evaluation – Focuses on the performance of the final objective (i.e. the performance of the component on the complete application)

Stakeholders typically care about extrinsic evaluation since they’d want to know how good the model is at solving the business problem at hand. However, it’s still important to have intrinsic evaluation metrics in order for the AI team to measure how they are doing. We will be focusing more on intrinsic metrics for the remainder of this article.

Defining the Metrics

Some common intrinsic metrics to evaluate NLP systems are as follows:

Accuracy

Whenever the accuracy metric is used, we aim to learn the closeness of a measured value to a known value. It’s therefore typically used in instances where the output variable is categorical or discrete – Namely a classification task.

Precision

In instances where we are concerned with how exact the model’s predictions are we would use Precision. The precision metric would inform us of the number of labels that are actually labeled as positive in correspondence to the instances that the classifier labeled as positive.

Recall

Recall measures how well the model can recall the positive class (i.e. the number of positive labels that the model identified as positive).

F1 Score

Precision and Recall are complementary metrics that have an inverse relationship. If both are of interest to us then we’d use the F1 score to combine precision and recall into a single metric.

To delve deeper into these 4 metrics, have a read of the Confusion Matrix "Un-Confused".

Confusion Matrix "Un-confused"

Area Under the Curve (AUC)

The AUC helps us quantify our model’s ability to separate the classes by capturing the count of positive predictions which are correct against the count of positive predictions that are incorrect at different thresholds.

For a more in-depth reading on this metric check out Comprehension of the AUC-ROC curve.

Comprehension of the AUC-ROC curve

Mean Reciprocal Rank (MRR)

The Mean Reciprocal Rank (MRR) evaluates the responses retrieved, in correspondence to a query, given their probability of correctness. This evaluation metric is typically used in informational retrieval tasks quite often.

Final Thoughts

In this article, I provided a number of common evaluation metrics used in Natural Language Processing tasks. This is in no way an exhaustive list of metrics as there are a few more metrics and visualizations that are used when solving NLP tasks. If you want me to delve deeper into any one of the metrics, please leave a comment and I will get to work on making that happen.

Thank you for reading! Connect with me on LinkedIn and Twitter to stay up to date with my posts about Data Science, Artificial Intelligence, and Freelancing.