
Have you ever felt that you have tried many Machine Learning methods and their hyperparameter configurations but still haven’t found any good results? This might be something for you!
Actually, there are many methods you can use to improve the performance of your model, from choosing the state-of-the-art methods, increasing the number of epochs, changing the batch size, choosing the right optimizer, and many more. However, the first thing that comes to mind before creating a machine learning model is data. Interestingly, data can also be your last resort when you don’t know what to do with your model after several experiments.
The real question is, how do you benefit from data to improve the performance of your models? The answer is simple, you need to have a good data. I emphasize again, as in the previous post, good data does not have to be in large amounts. Although many latest models use large amounts of data, sometimes small but high-quality data is also sufficient.
Here in this post you will learn how to improve your data qualitatively and quantitatively.
Diagnose the Problem
Once the data is obtained, you need to make sure that there is nothing wrong with the data. There are several reasons why data becomes problematic. The first is inconsistency in labeling. This can occur due to human error or differences in perspective (if there is more than one person labeling the data).
To fix this, you need to check your data and make corrections if you find labels that are not quite right. I know it’s a tough process. If you feel you can’t check all the data, you can ask other people for help. Or if there is more than 1 person who labels, each person can check other people’s work.
If still not possible, you can check a small portion of the data. If you find that there are no problems with the data, you can immediately try the other below options. However, if your data is in trouble, you have no other choice but to fix it.
No need to check all of them in one go, you can start by looking at the current model scores. Find class whose scores is the lowest, then you can prioritize data with that class to be checked first. After that, you can train your model again and see the results.
Everything Must Be in Balance
The second cause of problematic data is uneven distribution. Imbalanced data is often the reason why models have low scores.
To find balance, we can do oversampling or undersampling, or both. When certain classes have less numbers than others, you can simply duplicate them or oversample them using several methods which will be discussed in the next few points. Conversely, when a certain class has more numbers than others, you can reduce the size. One thing to note, you have to be careful using these methods, because sometimes they can also cause overfitting.
Variety is the Key
Once you have a balanced dataset, you can then list which cases have been included in the current dataset, and which have not. This is useful for completing any cases that are not yet in the training data. If increasing the number of data is difficult, then increasing the number of cases is a smart choice. The more real-world cases vary in your training data, the better the model will understand things and produce higher scores.
It is the same when you are about to face a test. You want to study as many chapters as you can because you want to get good grades, right? As with your model, the more it learns from the complete training data, the better the results will be.
Beside that, you will most likely study the chapters according to your test material. The way your model works will be the same, depending on the domain. Suppose you have a dataset on a laptop domain, you only need to add new cases related to laptop domains. Adding cases from other domains can actually lead to poor performance.
Improvise a Little
Even though, there are plenty of data everywhere on the internet, finding data which is suitable to your need is not always easy. If that is your case, you can trick it by paraphrasing the sentences containing certain cases which are difficult to find on the internet.
This can be done manually by rewriting the sentences without distorting their meaning. You can use your own words, change the structures, or find the synonym of particular words. Another advanced approach to do paraphrasing is using a language model.
To help you get a sense of how language model works for paraphrasing, below are some useful resources:
Besides paraphrasing, we can also generate synthetic data using several data augmentation methods. From the simplest method called Back Translation to other more complex methods that you can find here.
Learn from Others
Many methods have been utilized to ease the effort of collecting data. This one is a little bit different. A method called transfer learning allow you to take advantage of old models to be used for your small dataset. These are some articles to help you get a better understanding about the definition and use of transfer learning.
- Transfer Learning – Machine Learning’s Next Frontier
- The State of Transfer Learning in NLP
- Transfer Learning In NLP
Those are some of the ways I have tried when my model got stuck getting good results. Maybe this doesn’t cover all the methods to take your data to the next level, but hope it helps!
If you enjoyed reading this post and would like to hear more from me and other writers here, join Medium and subscribe to my newsletter. Or simply follow the links below. Thank you!