Suppose you have identified a machine learning model and the corresponding hyperparameters that give you the best performance but still the accuracy of the model is below the baseline/expected accuracy. Is it the end of the road for you or can you improve it further?
In this blog, I will take you through a 3 step process for improving your model performance beyond hyperparameter tuning. Let’s get started!

Step1: Error Analysis
The objective of the error analysis is to identify the room for improvement using the following approach:
- Identify where the model went wrong
The overall accuracy number is like a summary, it does not point to the specific areas where the model went wrong. So, it is important to identify the areas where the model went wrong so that we can devise some strategies for improvement.
Example: Pneumonia detection using X-ray images.

Imagine we are building a model for detecting Pneumonia using chest X-rays and we want to improve the accuracy of our model.
Analyze Samples
We can pick 200 samples where the model prediction went wrong and try to understand the reasons for the wrong predictions. As we analyzed the cases, we found that some images are blurry, some have dark/light backgrounds, reflections, etc. We can use these categories to create a summary table(shown below) to capture the reasons for wrong predictions.

Summarize the analysis
Now, we can create a summary table to find the distribution of errors across the tags(Error Count/Total Error) and the error rate of these tags (Error Count/Total images with that Tag).

In order to find ‘Total Images with that Tag’, either you can train some model to identify the tags or you could pick a sample, find the distribution of tags and scale it to the population.
Looking at the table above, we can find that most of the errors are concentrated in the ‘Dark Background’ tag(40%) and the error rate is also maximum within this tag(16%). Does it mean that we should focus on this tag for improving our model performance?
Not really, we need to compare these error rates with the baseline performance.
2. How far away from the baseline
In order to find where should we concentrate on improving our accuracy, we need to compare the model error rates with the baseline error rates.
Now, a baseline for you could be one of the following :
- Human-level performance for that task.
- State-of-art/open-source performances for your use case.
- A previously deployed model that you are planning to replace.
In the table below, we are adding the human-level performance (or error rates) as a baseline.

We will find the gap/difference between the error rate of our model and the human level error for the reason tags.

3. Prioritize what to work on
In order to prioritize what to work on, we need to add one more angle: find the percentage of data having the tags.
Now combining (multiplying) ‘gap’ and ‘percentage of data with that tag’ we can get potential uplift in the accuracy if we reach human-level accuracy for the different tags.

‘Dark Background’ and ‘Reflection’ have the maximum potential of improving the overall accuracy. So, we can prioritize one of these based on the ease of implementation.
Step 2: Improve the Input Data
Let’s say we are focussing on improving the accuracy of the tags/data with ‘Dark Background’.
Now, we will try to generate more data points that capture the realistic examples where our algorithm does not perform well but humans (or other baselines) do well on.
We can follow one of these strategies to generate more data points:
- Collect more X-ray images having a dark background.
- Create synthetic data points: We can select the X-ray images without the dark background and synthetically add a dark background to these images.
Step 3: Model Training
Step 3 is to train your model on this updated input dataset (preexisting data points + newly created data points).
Repeat all the 3 steps to reach the human level (or baseline) performance.
We(me and my team) are currently following this 3 step approach on one of the text classification use cases and we have been able to increase the classification accuracy from 64% to 75%.
Conclusion
In order to improve the model performance beyond hyperparameter tuning, we can use the error analysis technique to identify the categories/tags for which the model underperforms as compared to the baseline. Once we identify the tags to focus on, we can improve the input data corresponding to these tags and re-train our model. We can repeat this cycle to get to the expected accuracy(or performance metric).
If you find my blogs useful, then you can follow me to get direct notification whenever I publish a story.
If you like to experience Medium yourself, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly, and you have the chance to make money with your writing as well.