The world’s leading publication for data science, AI, and ML professionals.

Evaluation Metrics for Machine Learning

Certain metrics measure model performance better than others.

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

There are a plethora of metrics to determine the performance of Machine Learning models. It is beneficial to know which evaluation metric will properly measure your model performance. Certain metrics measure model performance better than others, depending on the use case. We will go over some of the common evaluation metrics for both regression and classification models.

Regression

Regression is used to predict a continuous output and minimizes error. For example, you could use a regression model to predict something like sales or revenue. With two of the fundamental evaluation metrics for regression models being Mean Absolute Error and Mean Squared Error.

Note: A trick to calculating regression metrics is to work backwards. Let’s use Mean Absolute Error (MAE) as an example. First calculate the Error, then the Absolute value, and finally the Mean a.k.a average. We essentially calculated MAE backwards as "EAM".

MAE (Mean Absolute Error)

Mean Absolute Error calculates the absolute difference between the actual values and the predicted values from your model. MAE is good to use as a baseline, since it takes a look at the absolute error. The problem with using MAE, is that the metric is relative to the values and residuals. For example, what is conisdered a good or bad MAE? Obviously you’d like to have MAE close to 0, but when your errors are large values it becomes more difficult to evaluate your model using MAE.

Note: Using absolute error is also used as L-1 Normalization for Lasso Regression

Formula to calculate MAE:

MAE Formula
MAE Formula

MSE (Mean Squared Error)

Mean Squared Error calculates the error and then squares the difference, before calculating the mean or average. MSE is another good metric to use as a baseline, since it is a fundamental evaluation metric like MAE. However, MSE inflates errors since each value is squared. Again causing evaluating your model to be difficult.

Note: Using the sqaured values is also used as L-2 Normalization for Ridge Regression

Formula to calculate MSE:

MSE Formula
MSE Formula

RMSE (Root Mean Squared Error)

Root Mean Squared Error is just like MSE, but taking the sqaure root of the output. RMSE is another fundamental evaluation metric based on the squaring of residuals, but penalizes larger errors more.

Formula to calculate RMSE:

RMSE Formula
RMSE Formula

MAPE (Mean Absolute Percentage Error)

Mean Absolute Percentage Error attempts to solve the issue with MAE. In which MAE becomes relative based on the scale of your residuals. MAPE will transform the errors into percentages, where you want your MAPE to be as close to 0 as possible. There are other metrics using absolute values and percentages as well. (e.g. APE, Weighted MAPE, Symmetrical MAPE, etc.)

Formula to calculate MAPE:

MAPE Formula
MAPE Formula

MASE (Mean Absolute Scaled Error)

Mean Absolute Scaled Error is a metric that allows you to compare two models. Using the MAE for each model, you can put the MAE for the new model in the numerator and the MAE for the original model in the denominator. If the MASE value is less than 1, then the new model performs better. If the MASE value equals 1, then the models perform the same. If the MASE value is greater than 1, then the original model performed better than the new model.

Formula to calculate MASE:

MASE Formula
MASE Formula

Classification

Classification is used to predict discrete outputs. A popular example is the concept of "Hotdog/Not Hotdog". Classifiers aim to maximize likelihood, to determine how samples should be classified.

Accuracy

Accuracy is one the most fundamental metrics, but can often be misleading. The reason is due to imbalanced classes. If you have a classifier that reaches 98% accuracy, but 98% of your data is classified as the same label, then your model isn’t necessarily very good. This is because your model just labeled everything as the same class.

Example: You have 98 "Hot Dogs" and 2 "Not Hot Dogs". You’re model will just label everything as a "Hot Dog" no matter what it is and still get 98% accuracy. Even though the model completely ignores everything else.

Therefore, you should add methods to balance out your data. (e.g. up/down sampling, synthetic data, etc.)

Accuracy Formula
Accuracy Formula

Precision

Precision evaluates how precise your model was at making predictions. This is a good metric to pay attention to if you want your model to be conservative in its flagging of data. An example of using precision wisely would be when you have high costs for each "positive" label that comes out of your model.

Precision Formula
Precision Formula

Recall

Recall evaluates the sensitivity of your model. Basically it checks how successful your model was at flagging the relevant samples.

Recall Formula
Recall Formula

F-Score

The F-Score, or also referred to as the F-Measure, is the harmonic mean of precision and recall. The F-Score is a great measure if you want to find balance between precision and recall for your model and should be used to find a generally good model.

Note: Below, the subscript "p" signifies predictions classified as "positive".

F-Score Formula
F-Score Formula

What are some tools I can use to help generate evaluation metrics?

In general, Scikit Learn has great evaluation metric functions. For classification metrics though, the Classification Report function handles a lot for you and shows model performance at a quick glance. Specifically listing out Precision, Recall, and F-Score.

Summary

Choose an evaluation metric depending on your use case. Different metrics work better for different purposes. Selecting the appropriate metrics also allow you to be more confident in your model when presenting your data and findings to others.

On the flipside, using the wrong evaluation metric can be detrimental to a machine learning use case. A common example is focusing on accuracy, with an imbalanced dataset.

Need a place to start and want more experience with Python? Check out Scikit Learn’s evaluation metric functions.

Links


Related Articles