Differences between Bias and Variance in Machine Learning

Suhas Maddali
Towards Data Science
7 min readSep 13, 2022

--

Photo by Julia Zolotova on Unsplash

Machine learning and data science have been gaining a lot of traction in the recent decade. We see numerous applications in self-driving cars, spam filtering, detecting defects in manufacturing units, and face recognition. Furthermore, it is estimated that companies can reach a high worth if they are successful in implementing machine learning and artificial intelligence. However, there can be times when the machine learning models are not understood thoroughly by the practitioners of ML, often leading to confusion and frustration. The concepts that we are going to discuss now are bias and variance respectively. These topics are covered in a large number of online courses but it would be worth noting down the differences between them and also understanding the right steps that must be taken to overcome them. Without further ado, let’s get started and understand these topics in greater detail.

What is Bias?

Photo by Pablo Arroyo on Unsplash

When we are using machine learning models, the first thing we do is to use our data and train our models to get the expected results on the data that they have not seen before. In order to do this, we continuously train our models and update the weights and parameters until we get the intended results. While doing so, there can be instances where we have not fully trained the models and they are too simple for our machine learning prediction tasks. In order words, we are just using the simple models rather than tuning those parameters or changing models to reflect the complexity shown in the dataset. In such a case, we see that the model (used for training) has a high bias. Hence, it is not able to make sense of the dataset quite well if it is suffering from high bias.

How to Overcome High Bias?

Having high bias can be a problem in machine learning where we are not fully exploring the potential of these algorithms. Therefore, we can take a look at various ways in which bias could be reduced to a large extent.

Less Complex Model — The primary cause of high bias is that the model is not complex enough to capture the intricacy in the dataset and the relation between the input and the output. We might have to take the right steps to overcome this situation and utilize the full power of machine learning. The best way to overcome higher bias is to add more complex models for prediction.

Cross-validation Data — The second way to overcome high bias would be to use the cross-validation data so that it can help identify the high bias issue. When we are dividing the data into training and cross-validation before testing our models, we are tuning the hyperparameters and altering them to make sure that they are performing well on the cross-validation data.

Finding Right Data — Another way that can also be handy would be to use the right data without bias for the models. Despite our models being too complex and being able to understand the intricacies in the data, it is also likely that they also fail in their undertakings just because the data contains no or minimum relationship between the input and the output. Hence, you might also want to check out how related is the input to the output before determining if the models have a high bias.

What is Variance?

Photo by CALIN STAN on Unsplash

There could be instances where the model might be doing exceptionally well on the training data. But when we try to judge the performance on the data that it has not seen before, it fails or at least does not give the performance as expected on the test data. In this scenario, we can say that the model has been overfitting and has a high variance in it. In order words, the model has simply learned a lot from the training data without being able to generalize well on the data it has not seen before. This is quite similar to our day-to-day situations of a student preparing for an examination. If the student learns only from the textbook instead of exploring alternate sources, it is highly likely that the student might not be able to relate well to new examples or information that is not actually present in the textbook. In this case, the student has put too much emphasis on the textbook for the exam preparation without being able to get a general idea or picture of the topics. This is what is known as overfitting to put things in laymen’s terms.

Overfitting can be a problem in machine learning where the practitioner is left with an inflated picture of the performance of the ML model and assumes that the model can perform similarly on the data that it has not seen before. However, this can be far from true in most of the cases. Since we are optimizing and making modifications to our algorithms by taking the training data, it is no surprise that it will be performing well on this data. When we put new data (test data) into our model, however, it might sometimes fail and have a large discrepancy between its performance on the training data and the test data. Therefore, a practitioner must take time to understand and know whether the model is overfitting. The following are some ways that help reduce overfitting to a large extent.

How to Overcome High Variance?

There can be various steps that help reduce overfitting (high variance) for our ML models. Let us now explore each of these ways below.

Regularization — It is a technique that is used to penalize the ML models if they are highly complex and the weights are capturing a large number of trends from the training data. Using regularization ensures that the weights that are usually assigned are reduced and do not play a big role in determining the output based on the input. In order words, this causes the models to be able to generalize well without focusing too much on the training data and being able to also give good results on the data that they have not seen before.

Ensembles — When we rely too much on just one ML model without considering a group, there are chances that it could overfit the data. On the other hand, taking the output from a list of models (ensembles) ensures that different possible outputs are considered before deciding on the best outcome for our model.

Reducing Features — Having a large number of features (high dimensionality) in our data can also sometimes cause the models to learn too much from the training data. To make things worse, features that are readily available during training data when not available during the testing phase can make the outcomes less accurate on this unseen data. Therefore, reducing the features and giving the models the ability to generalize well on the test data can be a handy approach.

Adding Training Data — If we have less training data, it means that it is not that complex for the ML model and it would just use the relationship between the input and the output to give predictions on the data that it has not seen before. Adding more data ensures that it is more complex and the model might have to work extra hard to determine the right relationship between the input and the output before making predictions again on the unseen data.

Performing Cross-validation — It is a process of dividing the data into 2 parts namely the training and cross-validation data before the performance is tested on real-time data. Initially, ML models are trained using the training data and then the performance is observed on the cross-validation data with the adequate tuning of hyperparameters. Once there is not a huge difference between the performance of the ML model on both the training and cross-validation data, the practitioner can halt the process of training and use this model that is hyperparameter tuned to gauge its performance on the real-time data to reduce the problem of high variance.

Use Early Stopping — Some of the most popular applications of artificial intelligence are in the field of deep learning. There are a huge number of customizations available for the networks along with a popular mechanism called “transfer learning” that is revolutionizing the field. Sometimes these networks can become too complex and learn too well from the training data without the ability to generalize well on the test data. Therefore, using methods such as early stopping during the training phase can be handy and leads to a reduction in high variance.

Conclusion

After going through this article, hope you found this article helpful and actionable when performing any ML-related queries or tasks. Understanding the distinction between high variance and high bias is useful before diagnosing the models and getting the right solution. Furthermore, overcoming these challenges with ML has a good impact on the development of the overall data cycle.

If you like to get more updates about my latest articles and also have unlimited access to the medium articles for just 5 dollars per month, feel free to use the link below to add your support for my work. Thanks.

https://suhas-maddali007.medium.com/membership

Below are the ways where you could contact me or take a look at my work.

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

YouTube: https://www.youtube.com/channel/UCymdyoyJBC_i7QVfbrIs-4Q

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium

--

--

🚖 Data Scientist @ NVIDIA 📘 15k+ Followers (LinkedIn) 📝 Author @ Towards Data Science 📹 YouTuber 🤖 200+ GitHub Followers 👨‍💻 Views are my own.