Take a moment away from the fancy algorithms

When I first started to learn about Data Science, support vector machine was my favorite algorithm. It was, of course, not the best one out there but it had a cool name. Besides, how it dynamically changes its classification strategy with the c and gamma parameters just amazed me.
Machine learning sounds appealing and charming. I think it plays key role in driving a lot of people to the field of data science. All those algorithms with fancy names make the newcomers thrilled.
I see the algorithms as the shining surface of the Machine Learning box. They are carefully designed and structured to solve problems with data. The success of the algorithms sometimes make us forget about what is in the box.
In order to solve problems with machine learning, we need to a have comprehensive understanding of what lays in the box. They are the basic principles and concepts that are essential to implement machine learning algorithms successfully.
They might be considered as the basics. However, they are of crucial importance for the performance and accuracy of the machine learning algorithms.
Your model is as good as the data
What machine learning algorithms do may sound a like magic. However, they only catch what is in the data. There are some critical pieces of information in the data but hard or impossible for human eye to catch.
Machine learning algorithms, if applied correctly, capture the embedded information and let us use it to solve problems. What they cannot achieve is to do magic and go beyond what is in the data.
This brings us to the most important piece of a data science product: the data. We need to spend much more time on the data than we do on the models. I’m not just talking about cleaning and reformatting the data.
The informative power comes from the data. The features should provide valuable insight. A substantial amount of time and effort is spent preprocessing the raw data.
We do not have to use the features in the raw data. Feature engineering techniques help us create more informative features from the existing ones.
Data leakage
Data leakage occurs when the data used in the training process contains information about what the model is trying to predict. It sounds like "cheating" but we are not aware of it so it is better to call it "leakage".
Data leakage is a serious and widespread problem in data mining and machine learning which needs to be handled well to obtain a robust and generalized model.
The most obvious cause of data leakage is to include target variable as a feature which completely destroys the purpose of "prediction". Using test data in the training process is also an example of data leakage. These are the obvious cases and are likely to be done by mistake.
Let’s also talk about not-so-obvious data leakage examples. Consider a model that predicts if a user will stay on a website. Including features that expose information about future visits will cause data leakage.
We should only use features about the current session because information about the future sessions are not normally available after the model is deployed.
Data leakage may also occur during data cleaning and preprocessing. Below is a list of common operations for cleaning and preprocessing the data.
- Finding parameters for normalizing or rescaling
- Min/max values of a feature
- Distribution of a feature variable to estimate missing values
- Removing outliers
These steps should be done using the training set only. If we use entire dataset to perform these operations, data leakage occurs. As a result, the model learns about not only the training test but also the test set. It totally defeats the purpose of prediction.
Bias and variance
Bias **** occurs when we try to approximate a complex or complicated relationship with a relatively simpler model. In such cases, we end up having a model that fails to capture the structure and relationship in the data.
Simple is good but too simple is dangerous. The performance of a high biased model is restricted. Even if we have millions of training samples, we will not be able to build an accurate model.
The errors on the predictions of a biased model tend to lean towards a certain direction. For instance, in case of a regression task, a biased model might always predict a value less than the actual value. The models with high bias are likely to underfit the data.
Contrary to bias, variance occurs when a model is too complex with respect to the structure within the data. The model tries to pick each and every detail. It even learns the noise in the data which might be just random.
Models with high variance are sensitive to the small changes in feature values. Thus, they are not generalized well which is known as overfitting. The model overfits to the training data but fails to generalize well to the actual relationships within the dataset.
Neither high bias nor high variance is good. The perfect model is the one with low bias and low variance. However, perfect models are very challenging to find, if possible at all.
There is a trade-off between bias and variance. We should aim to find the right balance between them. The key to success as a machine learning engineer is to master finding the right balance between bias and variance.
Conclusion
In order to create successful machine learning products, our focus should go beyond the algorithms. The basic principles are essential to the success of the algorithms.
We need to keep them in mind when solving problems with machine learning. Otherwise, even if we create a state-of-the-art model, it cannot produce satisfying results.
Thank you for reading. Please let me know if you have any feedback.