What attracts the aspiring data scientists the most is the world of machine learning algorithms. It was the case for me when I stepped into the field of Data Science.
It was so cool to talk about how I used a random forest regressor to predict used car prices. I spent a great amount of time tuning the hyperparameters to achieve a 1% performance improvement.
The world of Machine Learning algorithms is fancy. It has the potential to amaze people with a passion for data science. Thus, a substantial part of learning data science consists of machine learning algorithms. It would be compelling to do otherwise.
There is no problem with learning these algorithms. I definitely do not take a stand against it. However, focusing too much on the machine learning part may cause us to forget about the most important point: understanding the data.
Whatever you do with data science, the first priority has to be understanding the data. You should always know the data like the palm of your hand. Otherwise, it is inevitable to fail.
The algorithms may seem to be performing some fancy work or magic operations.
Nothing about a machine learning algorithm is magic. All they do to show what is in your data.
We still need the algorithms to explore some of the underlying structure or relationships within the data. But, they cannot go beyond what is in the data. The performance is limited to the quality and compatibility of the data.
We tend to spend less time on exploratory data analysis and go run algorithms to see action. Believe me this is not a proper method. You should spend long hours on understanding your data. There are several advantages to using this approach. I will try to elaborate on these advantages in the remaining part of the article.
Feature generation
Machine learning algorithms take features (i.e. columns) from the data and use them to learn the data. What I mean by learning includes the relationship between the features and target (in the case of supervised learning), the underlying structure of data, the correlation among features, and so on.
Features are of crucial importance for model performance. Consider a simple case first. You want to create a model to predict used car prices. The first features that come to mind are the brand, mileage, and year of the car.
If the model is not provided with the brand, it is likely to predict similar prices for a Porsche and a Toyota. Of course, the other feature values (e.g. mileage and year) should be similar.
This is an obvious case. We all know what influences the price of a car. However, real life cases are much more complicated. Nobody hires a data scientist to create such a simple model.
We can only detect valuable or informative features by knowing the data. It requires an extensive exploratory data analysis process. In some cases, domain knowledge is also critical for feature generation. It is a part of "knowing your data".
Most algorithms evaluate features based on a feature importance metric. However, it can only provide this information for the features fed into it. A model cannot tell us what kind of features it expects or desires. That would be a cool science-fiction movie.
Which algorithm to choose
"Knowing your data" can also help you choose an algorithm wisely. There are many machine learning algorithms for both supervised and unsupervised tasks. They all have their pros and cons.
How an algorithm performs also depends on the characteristics of your data. You cannot just adapt a single algorithm for any data.
Consider you are working on a clustering problem. K-means clustering algorithm is a popular one in this domain and it performs well in most cases. However, if the data points are grouped in a way that cannot be captured with circular clusters, a gaussian mixture model can be a better option.
Once you have a comprehensive understanding of the data at hand, it becomes relatively easy to choose and tune an algorithm. Most machine learning algorithms have knobs that can be used to tune them. These knobs are known as hyperparameters.
Hyperparameter tuning is usually done by trying different values for each hyperparameter. This process, of course, should not be totally random. You focus on a particular range. Knowing your data well can help you better estimate how the data will react to tuning of different hyperparameters.
Better evaluation
Model creation is an iterative process. It usually takes several iterations to obtain satisfying results. Each iteration should be an improved version of the previous one so that we can keep on increasing the performance of the model.
"Knowing your data" provides highly valuable insight into what to change at each iteration. You have an idea about how the model can react to a particular change.
There will be cases where you have a hard time figuring out the source of a problem. It can be challenging to determine why your model performs badly. In such cases, the data is where you should be looking first.
Conclusion
Everything about data science starts with data. Your model is as good as the data you fed into it. The success of your data product or app largely depends on the quality and compatibility of the data.
In any case or task, the first step is the data. Taking the first step well has a positive effect on the remaining parts. Thus, it is of crucial importance to have a comprehensive understanding of the data, or "know the data".
Thank you for reading. Please let me know if you have any feedback.