Learn AI Today

Learn AI Today 01: Getting started with Pytorch

Defining and training a Pytorch model and visualizing the results dynamically

Miguel Pinto
Towards Data Science
11 min readJul 19, 2020

--

Photo by Jukan Tateisi on Unsplash.

This is the first story in the Learn AI Today series I’m creating! These stories, or at least the first few, are based on a series of Jupyter notebooks I’ve created while studying/learning PyTorch and Deep Learning. I hope you find them as useful as I did!

What you will learn in this story:

  • How to Create a PyTorch Model
  • How to Train Your Model
  • Visualize the Training Progress Dynamically
  • How the Learning Rate Affects the Training

1. Linear Regression in PyTorch

Linear regression is a problem that you are probably familiar with. In it’s most basic form is no more than fitting a line to a set of points.

1. 1 Introducing the Concepts

Consider the mathematical expression of a line:

w and bare the two parameters or weights of this linear model. In machine learning, it is common to use w referring to weights and b referring to the bias parameter.

In machine learning when we are training a model we are basically finding the optimal parameters w and b for a given set of input/target (x,y) pairs. After the model is trained we can compute the model estimates. The expression will now look

where I change the name o y to ye (y estimate) because the solution will not be exact.

The Mean Square Error (MSE) is simply mean((ye-y)²) — the mean of the squared deviations between targets and estimates. For a regression problem, you can indeed minimize the MSE in order to find the best w and b .

The idea of linear regression can be generalized using algebra matrix notation to allow for multiple inputs and targets. If you want to learn more about the mathematical exact solution for the regression problem you can search about Normal Equation.

1.2 Defining the Model

PyTorch nn.Linear class is all that you need to define a linear model with any number of inputs and outputs. For our basic example of fitting a line to a set of points consider the following model:

Note: I’m using Module from fastai library as it makes the code cleaner. If you want to use pure PyTorch you should use nn.Module instead and you need to add super().__init__() in the __init__ method. fastai Module does that for you.

If you are familiar with Python classes, the code is self-explanatory. If not, consider doing some study before diving into PyTorch. There are many online tutorials and lessons covering the topic.

Back to the code. In the __init__ method, you define the layers of the model. In this case, it is just one linear layer. Then, the forward method is the one that is called when you call the model. Similar to __call__ method in normal Python classes.

Now you can define an instance of your LinearRegression model as model = LinearRegression(1, 1) indicating the number of inputs and outputs.

Maybe you are now asking why I don’t simply do model = nn.Linear(1, 1) and you are absolutely right. The reason I’m having all the trouble of defining LinearRegression class is just to work as a template for future improvements as you will find later.

1.3 How to Train Your Model

The training process is based on a sequence of 4 steps that repeat iteratively:

  • Forward pass: The input data is given to the model and the model outputs are obtained — outputs = model(inputs)
  • The loss function is computed: For the purpose of the linear regression problem, the loss function we are using is the mean squared error (MSE). We often refer to this function as the criterion — loss = criterion(outputs, targets)
  • Backward pass: The gradients of the loss function with respect to each learnable parameter are computed. Remember that we want to reduce the loss function to make the outputs close to the targets. The gradients tell how the loss change if you increase or decrease each parameter — loss.backwards()
  • Update parameters: Update the value of the parameters by a small amount in the direction that reduces the loss. The method to update the parameters can be as simple as subtracting the value of the gradient multiplied by a small number. This number is referred to as the learning rate and the optimizer I just described is the Stochastic Gradient Descent (SGD)optimizer.step()

I didn’t define exactly the criterion and optimizer yet but I will in a minute. This is just to give you a general overview and understanding of the steps for a training iteration or as usually called — a training epoch.

Let’s define our fit function that will do all the required steps.

Notice that there’s an extra step I didn’t mention beforeoptimizer.zero_grad() . This is because by default, in PyTorch, when you call loss.backwards() the optimizer adds up the values of the gradients. If you don’t set them to zero at each epoch then they will be always added up and that’s not desirable. Unless you are doing gradient accumulation — but that’s a more advanced topic. Besides that, as you can see in the code above, I’m saving the value of the loss at each epoch. We should expect it to drop steadily — meaning that the model is getting better at predicting the targets.

As I mentioned above, for linear regression the criterion usually used is the MSE. As for the optimizer, nowadays I always use Adam as my first choice. It’s fast and it should work well for most problems. I won’t go into details about how Adam works for now but the idea is always to find the best solution in the least amount of time.

Let’s now move on to creating an instance of our LinearRegression model, defining our criterion and our optimizer:

model.parameters() is the way to give the optimizer the list of trainable parameters and lr is the learning rate.

Now let’s create some data and train the model!

The data is simply a set of points following the model y = 2x + 1 + noise. To make it a little more interesting I make the noise larger for larger values of x. The unsqueeze(-1) in lines 4 and 5 is just to add an extra dimension to the tensor at the end (from [10000] to [10000,1] ). The data is the same but the tensor needs to have this shape meaning that we have 10000 samples and 1 feature per sample.

Plotting the data, the result is the image below, where you can see the true model and the input data + noise.

Input data for the linear regression model. Image by the author.

And now to train the model we just run our fit function!

After training, we can plot the evolution of the loss during the 100 epochs. As you can see in the image below, initially the loss was of about 2.0 and then it drops steeply down to nearly zero. This is to be expected since when we start the model parameters are randomly initialized and as the training progress they converge to the solution.

Evolution of the loss (MSE) for the 100 epochs of training. Image by the author.

Note: Try playing with the learning rate value to see how it affects the training!

To check the parameters of the trained model, you can run list(model.parameters()) after training the model. You will see that they are very close to 2.0 and 1.0 for this example since the true model is y = 2x + 1 .

You can now compute the model estimates — ye = model(x_train). (Notice that before computing the estimates you should always run model.eval() to set the model to evaluation mode. It won’t make a difference for this simple model but later it will, when we start using Batch Normalization and Dropout.)

Plotting the prediction you can see that it matches almost perfectly the true data, despite the fact that the model could only see the noisy data.

Visualizing the model estimates. Image by the author.

2. Stepping Up to Polynomial Regression

Now that we made it work for the simple case, moving to a more complex linear model is remarkably simple. The first step is of course to generate such input data. For this example, I considered the model y = 3x² + 2x + 1 + noise as follows:

Input data for the polynomial model. Image by the author.

Notice that this time the input shape is [1000, 2] since we have 2 features corresponding to x and . That’s how you fit a polynomial using linear regression!

The only difference now, compared to the previous example, is that the model needs to have two inputs — model = LinearRegression(2,1) . That’s it! You can now follow the exact same steps to train the model.

Let’s, however, make things a little more fun with some dynamical visualizations!

2.1 Visualize the Training Progress Dynamically

To animate the evolution of training we need to update the fit function in order to store also the values of the model estimates at each step.

You may have noticed a ‘new word’ — detach() (line 17 of the code). This is to tell PyTorch to detach the variable from the gradient computation graph (it will no longer compute the gradients for that detached variable). If you try to convert the tensor to NumPy before detaching, it will give you an error.

Moving on, you can repeat the same process to train the model as before. The only difference is that the fit2 function will also return the model estimates for each epoch of training.

To create a video/gif of the training take a look at the following code:

The %%capture tells Jupyter to suppress the output of the cell as we will be displaying the video in the next cell. Then, from lines 3 to 10, I set up the plot as usual. The difference is in the line for the model predictions. I initiate it as empty to then iteratively update the graphic using matplotlib.animation to generate the animation. Finally, the video can be rendered using HTML from IPython.display . Look at the result below!

Visualizing model predictions during training. Animation by the author.

It’s interesting that the blue line initially curves very fast to the correct shape and then converges more slowly for the final solution!

Note: Try playing with the learning rate, different optimizer and anything you can think off and see the effect on the optimization. It’s a good way to get some intuition for how the optimization works!

3. Neural Network Model

The examples above are interesting for learning and experimenting. However, in practice often your data is not generated from a polynomial or at least you don’t know what the terms of the polynomial are. A nice thing about neural networks is that you don’t need to worry about it!

Let’s start by defining the model that I named as GeneralFit :

There are some new aspects to consider in this model. There are 3 linear layers and as you can see in the forward method, after the first two linear layers a ReLU activation functionF.relu() — is used. ReLU stands for Rectified Linear Unit and it’s simply setting all negatives to zero. This apparently trivial operation is, however, enough to make the model non-linear.

Notice that a Linear layer is just matrix multiplication. If you have 100 linear layers one after the other, linear algebra tells you that there’s a single linear layer that performs the same operation. That single linear layer is simply the multiplication of the 100 matrices. However, when you introduce the non-linear activation function this changes completely. Now you can keep adding more linear layers interlaced with non-linear activations such as ReLU (most common in recent models).

A Deep Neural Network is no more than a Neural Network with several ‘hidden’ layers. Looking back to the code above, you can, for example, try to add more ‘hidden’ layers and train the model. And indeed, you can call that Deep Learning. (Note that hidden layer is just the traditional name for any layers in between the input and output layer.)

Using the above model and a new set of generated data I obtained the following training animation:

Visualizing model predictions for GeneralFit model during training with a learning rate of 0.01. Animation by the author.

For this example, I trained for 200 epochs with a learning rate of 0.01. Let’s try to set the learning rate to 1.

Visualizing model predictions for GeneralFit model during training with a learning rate of 1. Animation by the author.

Clearly this is not good! When the learning rate is too high the model may not converge properly to a good solution or may even diverge. If you set the learning rate to 10 or 100 it won’t go anywhere.

Homework

I can show you a thousand examples but you will learn more if you can make one or two experiments by yourself! The complete code for these experiments that I showed you are available on this notebook.

  • Try to play with the learning rate, number of epochs, number of hidden layers and the size of the hidden layers;
  • Try also SGD optimizer and play with the learning rate and maybe also with the momentum (I didn’t cover it in this story but now that you know about it you can do some research);

If you create interesting notebooks with nice animations as a result of your experiments, go ahead and share it on GitHub, Kaggle or write a Medium story about it!

Final remarks

This ends the first story in the Learn AI Today series!

Feel free to give me some feedback in the comments. What did you find most useful or what could be explained better? Let me know!

Next story in this series:

You can read more about my journey on the following stories!

Thanks for reading! Have a great day!

--

--

PhD student (Remote sensing, Meteorology), ML/DL enthusiast, fastai student, competition master at Kaggle, pianist/composer