Cars Efficiency Predictions with PyTorch

Learn how to build an entire deep learning pipeline in PyTorch

Marcello Politi

Published in

Towards Data Science

5 min readOct 18, 2022

Introduction

It is no secret that the price of petrol has skyrocketed in the last few months. People are filling up with gasoline the minimum necessary both because of a cost factor and for reasons related to the environment. But have you ever noticed that when you look up on the Internet how much your car should spend on gas to get from point A to point B, the numbers almost never match reality?

In this article, I want to develop a simple model that can predict the efficiency (or consumption) of a car measured as miles it travels with a single gallon (MPG).

The goal is to address somewhat all the steps in the pipeline, such as data processing, feature engineering, training and evaluation.

All of this will be done in Python using PyTorch. In particular, I will be relying on a Colab Notebook which I always find very convenient for these small projects! 😄

Dataset

The dataset we will use in this project is by now a milestone, Auto MPG dataset from the UCI repository. It consists of 9 features and contains 398 records. Specifically, the variable names and their types are as follows:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Let’s code!

First, we load the dataset and rename the columns appropriately. The na_values attribute is used to make pandas recognize that data of type ‘?’ should be treated as null.

Now use df.head() to display the dataset.

df.head()

With the df.describe() function we can display some basic statistics of the dataset to begin to understand what values we will find.

df.describe()

Otherwise, we can use df.info() to see if there are any null values and to find out the type of our variables.

We first see that the Horsepower feature contains null values, so we can begin to delete the records corresponding to those values from the dataset and reset the dataframe index as follows.

Now if we go to print len(df) → 392, because we have eliminated rows.

The next thing to do is to split the dataset into a train set and a test set. We use a very useful sklearn function to do this. Let’s then save the df_train.describe().transpose() table since we will need some statists to do preprocessing of some features.

train_stats :

Numerical Features

We are now going to process some features. We often treat numeric variables differently from categorical variables. So first now we start by defining only the numerical variables that we are going to normalize.

To normalize a feature all we need to do is subtract the mean of the feature and divide it by the standard deviation, for which we will need the statistics extracted earlier.

If we now go to plot the normalized features against those in the original dataset you will notice how the values have changed as a result of standardization. They will then have mean of zero and a standard deviation of one.

Now regarding the Model_Year feature, we are not interested in knowing in which year that particular car model was made. But maybe we are more interested in intervals or bins. For example, the car is type 1 if the model was made between 73' and 76'. These ranges are a bit arbitrary, you might try more of them to see which ones work best.

Categorical Features

As far as categorical features are concerned, we have basically two main approaches. The first is to use one-hot vectors to transform categories (strings) into binary vectors containing only one 1. For example, a zero category will be encoded as [1,0,0,0], category 1 as [0,1,0,0], etc.

Otherwise, we can use an embedding layer that maps each category into a ‘random’ vector that can be trained, so that we get a vector representation of the categories that manages to maintain a lot of information.

When the number of categories is large, using embeds of limited size can have great advantages.

In this case we use the one-hot encoding.

And let’s also extract the labels we have to predict.

PyTorch Dataset & DataLoader

Now that our data are ready, we create a dataset to better manage our batches during training.

Model Creation

We construct a small network with two hidden layers, one of 8 and one of 4 neurons.

Training

Now we define the loss function, we will use MSE and stochastic gradient descent as the optimizer.

To predict the new data point, we can feed the test data to the model.

Final Thoughts

In this short article, we saw how we could use PyTorch to tackle a real-life problem. We started by doing some EDA to understand what kind of dataset we had on our hands. Then I showed you how to treat numeric variables differently from categorical variables in the preprocessing phase. The technique of splitting column values into bins is widely used. Then we saw how PyTorch allows us to create with very few steps a custom dataset that we can iterate batch by batch. The model we created is a very simple model with few layers however using the right loss function and a proper optimizer allowed us to do the training of our network quickly. I hope you found this article useful for discovering (or reviewing) some PyTorch features.

The End

Marcello Politi

Linkedin, Twitter, CV