Introduction to supervised and unsupervised learning with an example of polynomial regression with Python implementation

Contents
This post is a part of a series of posts that I will be making. You can read a more detailed version of this post on my personal blog by clicking [here](https://cookieblues.substack.com/p/what-is-machine-learning) or on my Substack here. Underneath you can see an overview of the series.
1. Introduction to machine learning
- (a) What is Machine Learning?
- (b) Model selection in machine learning
- (c) The curse of dimensionality
- (d) What is Bayesian inference?
2. Regression
- (a) How linear regression actually works
- (b) How to improve your linear regression with basis functions and regularization
3. Classification
- (a) Overview of Classifiers
- (b) Quadratic Discriminant Analysis (QDA)
- (c) Linear Discriminant Analysis (LDA)
- (d) (Gaussian) Naive Bayes
- (e) Multiclass Logistic Regression using Gradient Descent
A little bit of history
It seems, most people derive their definition of machine learning from a quote from Arthur Lee Samuel in 1959: "Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort." The interpretation to take away from this is that "machine learning is the field of study that gives computers the ability to learn without being explicitly programmed."
Machine learning draws a lot of its methods from statistics, but there is a distinctive difference between the two areas: statistics is mainly concerned with estimation, whereas machine learning is mainly concerned with prediction. This distinction makes for great differences, as we will see soon enough.
Categories of machine learning
There are many different machine learning methods that solve different tasks and putting them all in rigid categories can be quite a task on its own. My posts will cover 2 fundamental ones; supervised learning and unsupervised learning, which can further be divided into smaller categories ash shown in the image above.
It’s important to note that these categories are not strict, e.g. dimensionality reduction isn’t always unsupervised, and you can use density estimation for clustering and classification.
Supervised learning
Supervised learning refers to a subset of machine learning tasks, where we’re given a dataset of N input-output pairs, and our goal is to come up with a function h from the inputs to the outputs. Each input variable variable is a D-dimensional vector (or a scalar), representing the observation with numerical values. The different dimensions of the input variable are commonly called features or attributes. Likewise, each target variable is most often a scalar.
In classification the possible values for the target variables form a finite number of discrete categories commonly called classes. A classic example is recognizing handwritten digits [1]. Given an image of 28×28 pixels, we can represent each image as a 784-dimensional vector, which will be our input variable, and our target variables will be scalars from 0 to 9 each representing a distinct digit.
You might’ve heard of regression before. Like classification, we are given a target variable, but in regression it is continuous instead of discrete. An example of regression could be predicting how much a house will be sold for. In this case, the features could be any measurements about the house, the location, and/or what other similar houses have been sold for recently – the target variable is the selling price of the house.
Unsupervised learning
Another subset of machine learning tasks fall under unsupervised learning, where we’re only given a dataset of N input variables. In contrast to supervised learning, we’re not told what we want to predict, i.e., we’re not given any target variables. The goal of unsupervised learning is then to find patterns in the data.
The image of categories above divides unsupervised learning into 3 subtasks, the first one being clustering, which, as the name suggests, refers to the task of discovering ‘clusters’ in the data. We can define a cluster to be a group of observations that are more similar to each other than to observations in other clusters. Let’s say we had to come up with clusters for a basketball, a carrot, and an apple. Firstly, we could create clusters based on shapes, in which case the basketball and the apple are both round, but the carrot isn’t. Secondly, we could also cluster by use, in which case the carrot and apple are foods, but the basketball isn’t. Finally, we might cluster by colour, in which case the basketball and the carrot are both orange, but the apple isn’t. All three are examples are valid clusters, but they’re clustering different things.
Then we have density estimation, which is the task of fitting probability density functions to the data. It’s important to note that density estimation is often used in conjunction to other tasks like classification, e.g. based on the given classes of our observations, we can use density estimation to find the distributions of each class and thereby (based on the class distributions) classify new observations. An example of density estimation could be finding extreme outliers in data, i.e., finding data that are highly unlikely to be generated from the density function you fit to the data.
Finally, dimensionality reduction, as the name suggests, reduces the number of features of the data that we’re dealing with. Just like density estimation, this is often done in conjunction with other tasks. Let’s say, we were going to do a classification task, and our input variables have 50 features – if we could do the same task equally well after reducing the number of features to 5, we could save a lot of time on computation.
Example: polynomial regression
Let’s go through an example of machine learning. This is also to get familiar with the machine learning terminology. We’re going to implement a model called polynomial regression, where we try and fit a polynomial to our data.
Given a training dataset of N 1-dimensional input variables x with corresponding target variables t, our objective is to fit a polynomial that yields values for future target variables given new input variables. We’ll do this by estimating the coefficients of the polynomial

which we refer to as the parameters or weights of our model. M is the order of our polynomial, and w denotes all our parameters, i.e., we have M+1 parameters for our _M_th order polynomial.
Now, the objective is to estimate the ‘best’ values for our parameters. To do this, we define what is called an objective function (also sometimes called error or loss function). We construct our objective function such that it outputs a value that tells us how our model is performing. For this task, we define the objective function as the sum of the squared differences between the predictions of our polynomial and the corresponding target variables, i.e.

and if we substitute h with the right-hand side of (1), we get

Let’s take a minute to understand what (2) is saying. The term in the parantheses on the right-hand side is commonly called the _n_th residual. It’s the difference between the output of our polynomial for some input variable and its corresponding target variable. The difference can be both negative and positive depending on whether the output of our polynomial is lower or higher than the target. We therefore square these differences and add them all up in order to get a value that tells us how our polynomial is performing.
This objective function is called the residual sum of squares or sum of the squared residuals and is often used as a way to measure the performance of regression models in machine learnig. The image below shows the differences between the polynomium that we’re estimating and the data we’re given. These differences are the errors (or residuals) that the objective function is taking the square of and summing.

So far, so good! Since the objective function tells us how well we’re doing, and the lower it is, the better we’re doing, we will try and find the minimum of the objective function. To find the minimum of a function, we take the derivative, set it equal to 0, and solve for our parameters. Since we have a lot of parameters, we’ll take the partial derivative of E with respect to the _i_th parameter, set it equal to 0, and solve for it. This will give us a linear system of M+1 equations with M+1 unknowns (our parameters w). We’ll go over the derivation of the solution to this problem in the next post, but for now we’ll just have it given. The solution to the system of equations is

where t denotes all our target variables as a column vector, and X is called the design matrix and is defined as

To sum up: we’re given N pairs of input and target variables, and we want to fit a polynomial to the data of the form (1) such that the value of our polynomium h is as close to the targets as possible. We do this by finding values for the parameters w that minimize the objective function defined in (2), and the solution to this is given in (3).
Python implementation of polynomial regression
Let’s try and implement our model! We’ll start with the dataset shown underneath, where x
is our input variables and t
is our target variables.
import numpy as np
x = np.array([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
t = np.array([-4.9, -3.5, -2.8, 0.8, 0.3, -1.6, -1.3, 0.5, 2.1, 2.9, 5.6])
To begin with we can define the order of our polynomial, find the number of data points, and then set up our design matrix.
M = 4
N = len(x)
X = np.zeros((N, M+1))
If we look at the definition of the design matrix in (4), we can fill out the columns of our design matrix with the following for-loop.
for m in range(M+1):
X[:, m] = x**m
Now we can find the parameters with the solution in (3).
w = np.linalg.inv(X.T @ X) @ X.T @ t
Using NumPy’s poly1d
function we can generate outputs for our polynomial.
h = np.poly1d(np.flip(w, 0))
x_ = np.linspace(0, 10, 100)
t_ = h(x_)
Now we can plot our estimated polynomial with our data points. I’ve also plotted the true function that the points were generated from.

Summary
- Machine learning studies how to make computers learn on their own with the goal of predicting the future.
- Supervised learning refers to machine learning tasks, where we are given labeled data, and we want to predict those labels.
- Unsupervised learning, as it suggests, refers to tasks, where we are not provided with labels for our data.
- Features refer to the attributes (usually columns) of our data e.g. height, weight, shoe size, etc., if our observations are humans.
- Classification and regression are supervised tasks, clustering, density estimation, and dimensionality reduction are unsupervised tasks.
- Parameters refer to the values, we want to estimate in a machine learning model.
- The process of estimating the values of the parameters is called the training or learning process.
- An objective function is a measure of the performance of our model.
References
[1] Y. LeCun et al., "Gradient-based learning applied to document recognition," 1998.