The world’s leading publication for data science, AI, and ML professionals.

Roadmap to Becoming a Data Scientist, Part 3: Machine Learning

From beginner to pro: key machine learning skills for data science aspirants

Introduction

Data Science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer architecture, ChatGPT, the Retrieval-Augmented Generation (RAG) framework, and state-of-the-art computer vision models – including GANs – have had a profound impact on our world.

However, with the abundance of tools and the ongoing hype surrounding AI, it can be overwhelming – especially for beginners – to determine which skills to prioritize when aiming for a career in data science. Moreover, this field is highly demanding, requiring substantial dedication and perseverance.

The first two parts of this series focused on essential math and software skills to obtain to become a data scientist. In this part, we will dive into probably the most exciting part that directly touches necessary Machine Learning skills!

This article will focus solely on the math skills necessary to start a career in Data Science. Whether pursuing this path is a worthwhile choice based on your background and other factors will be discussed in a separate article.

Roadmap to Becoming a Data Scientist, Part 1: Maths

Roadmap to Becoming a Data Scientist, Part 2: Software Engineering

Maths + Engineering → Machine learning

Machine learning is a very diverse domain, but to be successful in it, it is essential to have solid skills in both math and software engineering.

  • Math knowledge reinforces a deep understanding of the logic behind algorithms, which is useful for choosing better solutions, easier debugging, and grasping more complex ideas.
  • Software engineering allows for the efficient implementation of algorithms and pipelines in code using the best development practices.
Artificial Intelligence vs Machine Learning vs Deep Learning
Artificial Intelligence vs Machine Learning vs Deep Learning

01. Introduction

Before diving directly into algorithms, it is necessary to understand several important fundamental blocks. First of all, it is the definition of machine learning, its difference from artificial intelligence, and what makes a machine learning algorithm so distinct from a normal algorithm.

Due to the variety of machine learning methods, it is essential to distinguish the high-level differences between the most important methods:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning
Roadmap for getting started in machine learning
Roadmap for getting started in machine learning

After that, learners should understand the main types of problems, which include classification, regression, ranking, clustering, dimensionality reduction, recommendation, etc. In most courses, the initial focus is typically on supervised learning and how it is used to solve classification and regression problems. Other learning methods and problem types are usually considered more advanced topics and are studied later.

Apart from that, before studying concrete algorithms, learners should understand how the input data for those algorithms can be represented. In particular, this applies to the tabular format, which is frequently used. Terms like dataset, target, features, and objects should be clear from the beginning.

Finally, the last important topic in this chapter involves the evaluation of algorithms. It is necessary to study the main evaluation metrics and techniques to be comfortable later when estimating how good or bad a given algorithm is or when comparing several of them.

02. Classical machine learning

After building a solid foundation in machine learning, it is time to learn the main algorithms that work on tabular data. Not only are these algorithms widely used on tabular data, but they also play an important role in introducing smart concepts and ideas that can be reused in more complex algorithms and domains.

Classical machine learning roadmap
Classical machine learning roadmap

The first essential algorithm to study is linear regression. Under the hood, linear regression uses stochastic gradient descent (SGD), whose goal is to find the algorithm parameters that minimize a given loss function. Without SGD, it would be impossible to imagine other optimization algorithms and the entire AI domain, as a significant number of algorithms rely on SGD to find optimal weights. While studying linear regression, it is also an excellent opportunity to familiarize yourself with the most commonly used loss functions in machine learning.

Next comes the support vector machine (SVM). Although SVM is rarely used in practice due to its slow performance on large datasets, it still introduces the interesting concept of the kernel trick. This allows the transformation of initially linearly inseparable data into a new space where the data points can be easily separated.

Kernel Trick. The initial inseparable data in the 1D space (on the left) is transformed into a 2D space, where it gains a new dimension y = x² and becomes separable.
Kernel Trick. The initial inseparable data in the 1D space (on the left) is transformed into a 2D space, where it gains a new dimension y = x² and becomes separable.

The next family of algorithms worth exploring are tree-based algorithms, starting with the decision tree. A decision tree can recursively split data into two subsets based on a chosen binary condition at each tree node. As a result, when a new object is given for prediction, it is run through the entire structure of the decision tree to ultimately reach the leaf node corresponding to the predicted class.

Traditionally, after decision trees, the next topic is random forests, which consist of a set of decision trees. Given that a single decision tree can make errors in its predictions, a random forest improves the overall system by constructing several trees that can "vote" for the best prediction, thus reducing the overall error probability. The concepts of voting and bagging introduced in random forests can also be applied to any base algorithm (not just decision trees) to make the entire system more resilient to errors.

Voting Example: The input vector is sent to several models, each of which individually makes predictions. The most frequent prediction is selected as the final output.
Voting Example: The input vector is sent to several models, each of which individually makes predictions. The most frequent prediction is selected as the final output.

One of the most, if not the most, powerful algorithms for tabular data is gradient boosting. Like random forests, it combines several base algorithms, but it does so differently. Instead of aggregating predictions from several strong algorithms run independently, gradient boosting creates a sequential structure of weak algorithms. Each subsequent algorithm learns to reduce the accumulated error produced by the previous algorithms. The most popular variation of gradient boosting uses decision trees as the base algorithm.

Speaking of algorithms, I would ultimately recommend looking at the k nearest neighbours (kNN) algorithm as one of the best and simplest examples demonstrating the fundamental difference between parametric and non-parametric algorithms. Unlike the parametric algorithms discussed previously, kNN does not learn any parameters. Instead, it relies on certain assumptions about the data and predicts the class of a new object based on the class of the most similar objects from the training dataset.

kNN algorithm
kNN algorithm

At the same time, it is also necessary to learn several techniques for performing data analysis and processing, such as exploratory data analysis (EDA), feature engineering, one-hot encoding, and addressing issues related to class imbalance.

Finally, another important concept is hyperparameter tuning. At this stage, it is sufficient to understand what it is and to be able to implement the grid search strategy in code. Grid search is one of the simplest methods to adjust algorithms and improve their performance.

Personal advice

While learning all of these algorithms, it is also important to understand how the same algorithm can be adjusted to be used for both classification and regression tasks.

These algorithms might seem challenging at first for beginners, as they are very different from the classic algorithms used in computer science. One of the best pieces of advice for gaining a deep understanding of their workflow is to implement them manually in code without relying on any libraries.

Libraries and frameworks

In the machine learning industry, most of the time developers use pre-implemented algorithms from standard Python libraries. Given this, it is important to know how to use them in practice. Luckily, the majority of libraries provide a very easy-to-understand interface, so anyone with even basic coding skills can train and use machine learning models.

Python libraries: Scikit-learn (machine learning), Pandas (data analysis), NumPy (linear algebra)
Python libraries: Scikit-learn (machine learning), Pandas (data analysis), NumPy (linear algebra)

All of the algorithms discussed in this section are implemented in the Scikit-learn package in Python. Additionally, to perform basic data manipulations and run data analysis, it is necessary to know Pandas. Finally, it might also be worth exploring NumPy, a well-known Python package used for linear algebra tasks.

03. Deep learning

Deep learning is a subset of machine learning that focuses on solving problems using neural networks. A neural network is typically represented as a combination of fully connected layers consisting of perceptrons, where the input layer receives the dataset features, which are then transformed by the intermediate layers, and the predicted result is produced in the output layer.

To be confident when working with neural networks, it is essential to fully understand their learning process. In fact, a neural network can be viewed as a very complex mathematical function with a large number of parameters. As was done with linear regression, we can apply the SGD algorithm to perform model updates and ultimately find the best neural network parameters. Easy, right?

However, the reality is quite different, and there is a whole theory dedicated to training neural networks because the simple SGD algorithm is usually not enough.

Deep learning roadmap
Deep learning roadmap

First, it is necessary to understand the details of the forward and backward propagation algorithms for training a neural network, along with their vectorization techniques, which can be applied to accelerate the training process. Computational graphs and the chain rule of differentiation (which should be learned earlier in calculus theory) play a crucial role in backpropagation.

Next, learners should study different types and properties of activation functions, which transform the original linear neural network into a more complex set of mathematical operations, enabling the solution of more sophisticated tasks.

Deep learning optimizers and learning rate schedulers also play a key role in modern neural networks, allowing them to converge to the optimum much faster. The most important optimizers to study are Momentum, RMSProp, AdaGrad, and Adam.

Due to the complex structures of neural networks, vanishing and exploding gradients can become significant problems that prevent the network from learning. Therefore, it is essential to know how to handle such situations. One such method involves using skip connections.

Finally, to reduce the chance of overfitting, it is necessary to apply standard regularization techniques, which generally include batch normalization, weight decay, and dropout.

Personal advice

In contrast to classic machine learning algorithms, I would generally advise beginners against implementing neural networks from scratch.

Nevertheless, it is always an excellent way to gain a deeper understanding of how neural networks function in practice. The problem is that deep learning algorithms are much harder to implement compared to the previous algorithms we discussed, and they may require a significant amount of time that could be better spent focusing on other theoretical aspects.

Regardless of the decision you ultimately make, I believe it is vital to understand the theoretical concepts of deep learning described above.

Libraries and frameworks

The top three well-known state-of-the-art Python frameworks for working with neural networks are PyTorch, TensorFlow, and Keras. Learners often ask which of these three they should choose for implementing their own networks.

The most popular deep learning frameworks: PyTorch, TensorFlow, Keras
The most popular deep learning frameworks: PyTorch, TensorFlow, Keras

In reality, when building a neural network, there is not much difference in terms of which of these three frameworks is used, as they all provide essentially the same functionality for basic tasks. Moreover, the code often looks almost identical in all of them.

It might play a role in the future if you work on a project that uses a particular framework or if you are an advanced researcher who needs very specific functionality that is not implemented in other frameworks. However, for beginners, any framework will be a good choice as long as the developer understands what happens behind the scenes when constructing and later training a neural network architecture.

Keras is built on top of TensorFlow and provides the simplest functionality for those taking their first steps in deep learning. However, in the long term, I would encourage learners to choose between TensorFlow and PyTorch.

Conclusion

In this article, we have covered the necessary machine learning theoretical blocks that every data scientist or machine learning engineer should know.

If you have gained solid knowledge of math, software engineering, and machine learning skills as described in the first three articles of this series, then you should be confident enough to consider yourself at least a Junior data scientist. Despite this significant achievement, it may still be challenging to find a job in today’s market, which is highly competitive in data science.

If you have reached this point, you should now be able to pick up and study more advanced machine learning topics that will expand your expertise and contribute to your professional growth.

These topics and new domains will be discussed in the fourth part of this series.

Resources

All images are by the author unless noted otherwise.


Related Articles