If you’re just starting out with Data Science, you’ll quickly come into contact with Machine Learning as well. Below we have listed eight basic Machine Learning algorithms that every data scientist should know and understand. If you realize while reading this post that you don’t know the algorithm at all or not yet sufficiently, then feel free to use the posts that are linked in each case. There you will find detailed explanations, applications, as well as advantages and disadvantages of the individual algorithms.
1. Linear Regression
Regression is used to establish a mathematical relationship between two variables x and y. Both statistics and machine learning Fundamentals deal with how a variable y can be described by one or more variables x. Here are some examples:
- What is the influence that learning time (= x) has on a final exam grade (= y)?
- Is the actual harvest in agriculture (= y) depending on the usage of plant fertilizer (= x)?
- Does the change in the number of police officers (= x) explain the increase in the city’s crime rate (= y)?
We call the variable y also the dependent variable or regressand since its change is relying on outside factors.
The variable x, on the other hand, is known as the independent variable or regressor since it is the possible cause for a change in the dependent variable.
For linear regression, we try to find a linear mathematical relationship describing the influence x has on y. In doing so, the algorithm is fitting a linear line to the training data set by minimizing the overall distance between the line and all the data points.
In case, we have only one independent variable, these are the parameters that the linear regression needs to fit:
- β0: Intersection with the y-axis, e.g. the exam grade that would be achieved without studying.
- β1: Increase of the regression line, e.g. the influence that an additional hour of studying has on the exam grade.
- u: Error term, e.g. all influences that have an effect on the exam grade but are not captured by the learning time, e.g. prior knowledge.
- y: Variable, which one would like to predict with the help of linear regression, in our case the final exam grade.
- x: Variable that is used as the basis of the prediction and has an effect on y, in this case, the hours learned for the exam.
2. Logistic Regression
Logistic Regression is a subcategory in the field of regression analysis that gives the opportunity to use ordinally scaled dependent variables. That means that the regressand can only take on a limited number of values. The Logistic Regression result is then interpreted as a probability for the data point to belong to a certain class.
In Logistic Regression, the basis is still to fit a regression equation to a given set of data points. Just, in this case, we cannot use a linear equation anymore since we have an ordinal regressand. Let’s see why by using a real-life example:
Our model should predict the likelihood of a given individual purchasing an e-bike. The independent variable that it uses is the individual’s age. The data set that we get by interviewing random people on the streets is as follows:
From this graph, we can infer that young adults are very unlikely to buy an e-bike (bottom-left corner) whereas seniors are mostly owning one (top-right corner). Even though there are outliers in both age groups, we would assume that a model would infer that with growing older, one is more likely to buy an e-bike. But how can we define that in a mathematical way?
First, we need to find a mathematical function representing the point distribution in the diagram and that is obviously not a linear one. What we need is a function that lies between 0 and 1 on the y-axis since there are two groups that could also be represented by 0 and 1. The mathematical function that we are looking for is the sigmoid function:
Its functional equation is as follows:
In the case of the e-bike example, this would transform to:
Now, we are having the basis for our Logistic Regression and we can start fitting the parameters a and b iteratively by using our data set. That way, we will end up with a function that perfectly fits the given data:
In real life, the function notation is mostly different from the one you have seen in this article. There, you rearrange it to have it in a similar form to the linear regression:
3. Support Vector Machine
Support Vector Machines (SVMs) belong to the Machine Learning algorithms and have their foundation in Mathematics. They are used to do classifications, such as image classification. There they have advantages over neural networks since they can be quickly trained and do not need a lot of training data for good results.
We can use the SVM for the following data set in which we have two classes (blue and yellow) as well as two features (x-axis and y-axis). Our algorithm should learn the color classification of our training data set. In the two-dimensional space the data set could look as follows:
The Support Vector Machine tries to fit a so-called hyperplane which should separate both groups in the best way possible. Depending on the dimensions, the hyperplane can be a line, as in our case, or a plane in more dimensions.
A well-fitted plane can then be used to separate the objects in classes. In our case, all data points laying to the left of the plan are in the classes "yellow" and all points to the right are "blue".
The SVM uses different training runs to fit the plane as optimally as possible. The goal is to maximize the gap, which is the distance from the nearest points of each group to the hyperplane, meaning that the plane should lay perfectly in the middle.
Support Vector Machine (SVM) – easily explained! | Data Basecamp
4. Decision Tree
The Decision Tree is known for decision-making and has found its way into the Machine Learning world. The tree-like structure tries to learn multiple decision stages with its possible response paths. Decision Trees can be used for regression analyses or classifications.
The tree has three major components which are the Root, Branches, and Nodes. The following example is a Decision Tree for whether we should do sports outside or not. It helps to better understand the structure:
The node "Weather" is the root node in our example. It is used as the starting point for our decision-making process. A tree usually has just one root point in order to have a similar starting point for all decisions.
From the root node, there are in total three branches (sunny, cloudy, rainy). Two of them are followed by new nodes (sunny and rainy). At these nodes, one needs to make another decision. Only when the weather is cloudy, one ends up at a result or so-called leaf. This means that when the weather is cloudy, one should always go outside for exercising, at least that’s what the Decision Tree tells us.
During sunny or rainy weather, there is another decision to make, depending on the humidity. If the weather is sunny, the humidity can take on the values "high" or "normal". That means sunny weather paired with high humidity is not the perfect weather to do sports outside.
During rainy weather, there is another important branch for the decision. In this case, the wind becomes more important in decision-making. From there, we can infer two rules. If the weather is rainy and the wind is strong, one should not do sports outside. However, with rainy weather and weak wind, one can go outside to do some sport.
5. Random Forest
The Random Forest is a Machine Learning algorithm from the field of Supervised Learning. It is a so-called Ensemble model since it is composed of multiple Decision Trees.
Each individual tree makes a prediction for the task at hand, for example, a classification. The Random Forest then combines all the individual decisions and takes the result that is supported by the majority of the trees. But why is it better to have multiple trees predicting the same use case compared to one?
The intention behind that is the so-called wisdom of crowds. Its idea is that the decision of a single individual is always worse than the combined decision of many. This was first discovered in 1906 during a fair.
Back then, an ox was shown to a crowd of 800 people who were asked to estimate the ox’s weight. Afterward, the weight should be officially confirmed by weighing the animal. A mathematician analyzing the estimations found that the median of the crowds’ estimates was only 1 % away from the actual weight. However, no single individual submitted an estimate that was closer to the correct result. This made him introduce the wisdom of the crowds stating that the estimation of the many is better than the estimation of one.
We can apply this to Random Forests by stating that a collection of different decision trees will always outperform any of the single trees.
There is just one little condition that needs to be fulfilled: the trees and their errors are not allowed to be correlated in any way. In this way, the error of a single tree can be compensated by others.
In our weight estimation example, this means that all estimations should be done individually and without discussing it before. Otherwise, one person could be influenced by the estimation of another participant, and thereby their errors are correlated. Then, the wisdom of the crowds does not appear anymore.
6. Naive Bayes
In Naive Bayes, we have another basic classification method that is based on the mathematical theorem called the Bayes Theorem. However, it can only be used if we assume that all the features are uncorrelated to each other. This is what makes the algorithm naive since this assumption is usually false.
The Bayes Theorem describes a formula for the conditional probability P(A|B) meaning the probability that A occurs when event B has already occurred (What is the probability that I have Corona (= event A) if my rapid test is positive (= event B)?).
The formula for the Bayes Theorem is:
- P(B|A) = probability that event B occurs if event A has already occurred
- P(A) = probability that event A occurs
- P(B) = probability that event B occurs
In reality, the probability P(A|B) is very hard to find. Using the Corona example, we need a more sophisticated test to find out whether the person is really positive. The inverse probability P(B|A) is a lot easier to find. For our example, it measures the likelihood of a person infected with Corona having a positive rapid test.
We can find this probability by taking persons that have a confirmed infection and letting them perform rapid tests. Afterward, we can calculate the ratio of how many of these tests were actually positive compared to how many tests were taken. Now, we can use the probabilities P(B|A), P(A), and P(B) to calculate the conditional probability P(A|B).
For a data set with more than one feature, the algorithm stays the same and we compute a conditional probability for each of the combinations between feature x and class K. Afterward, all the class probabilities for one feature are multiplied. The class with the highest overall probability is then the prediction.
7. Perceptron
A perceptron is the basic building block of a Neural Network and is part of Supervised Learning. It contains a single neuron calculating the output by using an activation function and the weighted input values.
The basic algorithm is coming from Mathematics and was later on used in Computer Science and Machine Learning. It tries to mimic the structure of the human brain and is able to learn complex relationships.
One perceptron can have multiple inputs able to process numerical values. The inputs have weights stating the importance of the input for the final output. The learning process involves changing these weights to produce output values that are as close to the ones from the training set as possible.
The neuron calculates the weighted sum of the input values and the output weights. It is then given to the activation function which is a special function scaling from 0 to 1. In its simplest form, the neuron can only have a binary output (Yes/No, True/False, Active/Inactive).
In most cases, the so-called Sigmoid Function is used as an activation function since it ranges from 0 to 1 and has a steep increase at x = 0. That is why it is a good approach for binary classification.
8. Artificial Neural Network
Artificial Neural Networks are the advancement of single perceptrons and are coming even closer to our human brain. It is used to solve more difficult classifications or regressions.
The basic neural network uses a multi-layer structure of perceptrons where the output of one perceptron is just the input of another one in the following layer. The basic concepts of the perceptron, however, are still used:
- The weight factors decide the importance of the input for the neuron. High weight means that the input is very important to solve the problem at hand.
- Then, we calculate the weighted sum is calculated by multiplying the input values and their weights. Additionally, there is a bias which gets added:
- Subsequently, the result is given into a so-called activation function.
The activation function is chosen depending on the use case. It can either be the sigmoid function for binary classification or other step functions.
The information passes through the network in different layers:
- Input layer: These contain the inputs that we get from the data set and which should be processed through the network.
- Hidden layer(s): One or more so-called hidden layers take over the inputs and calculate an output in the way that we described. Depending on the architecture, their outputs are transported to another layer of neurons. The layers are hidden since the layers are not giving any visible results but instead pass their outputs on to other neurons.
- Output layer: This layer follows the last hidden layer and calculates the final results of the network.
The architecture is then given to training which makes the network learn how to produce perfect results for a specific use case. One uses a training data set which has specific data points and the according result. For each data point, the network calculates its result and compares it to the right outcome from the data set.
If the calculated result is incorrect, the weights of the individual neurons are changed using the backpropagation algorithm. By iteratively repeating this process with the whole data set, the performance of the neural network is improving.
This is what you should take with you
- Solid knowledge of Machine Learning algorithms for every Data Scientist.
- In this article, we introduced often-used algorithms and shortly explained how they work.
- They build the bases for more advanced models which are used today.
_If you like my work, please subscribe here or check out my website Data Basecamp! Also, medium permits you to read 3 articles per month for free. If you wish to have unlimited access to my articles and thousands of great articles, don’t hesitate to get a membership for $5 per month by clicking my referral link:_ https://medium.com/@niklas_lang/membership
Beginner’s Guide: Extract, Transform, Load (ETL)