Deep Learning Tabular Data with PyTorch

Published in

Towards Data Science

6 min readFeb 7, 2020

This Post will provide you a detailed end to end guide for using Pytorch for Tabular Data using a realistic example. By the end of this post, you will be able to build your Pytorch Model.

A few things before we start:

Courses: I started with both fast.ai courses and DeepLearning.ai specialization (Coursera). They gave me the basic knowledge about DeepLearning. The great Stanford cs231n is also highly recommended.

It’s very easy to watch more and more courses. I think that the most important thing is to be “Hands-On”. Write the code! Start a project or try to tackle a Kaggle competition.

Use Python’s set_trace() to fully understand each step.
One can find the full code Here

The Data

I chose to work on the New York City Taxi Fare Prediction from Kaggle were the mission is to predict a rider’s taxi fare. Note that its a regression problem. You can find more details and the full Dataset Here.

The training data contains more than 2 Million samples (5.31 GB). To minimize training time, One took a random subset of 100k training samples.

import pandas
import randomfilename = r"C:\Users\User\Desktop\offir\study\computer learning\Kaggle_comp\NYC Taxi\train.csv"n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip listdf = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")

GPU

I wrote my code using the free Google Colab.

To use the GPU: Runtime -> Change runtime settings -> Hardware accelerator -> GPU.

Code

Import Relevant libraries

After Running the following command you need to upload the CSV file from your computer. Check the CSV file you are uploading is named sub_train.

Upload also the test set

Data preprocessing

The next step is to delete all the fares that are less than 0 (they don’t make sense)

The length of df_train is now 99,990. it’s very important to keep track of the types and lengths of your different datasets at every step.

Stacking train and test set so that they undergo the same preprocessing

The goal is to predict the fare amount. Therefore it was dropped from the train_X data frame.

Moreover, I chose to predict the log of the price while training. the explanation is out of the scope of this blogpost.

Feature engineering

Let’s do some feature engineering.

One Define the haverine_distatance function and Add a DateTime column to derive useful statistics. you can see the full process in the GitHub Repo.

After this stage the Dataframe looks like this:

Prepare the model

Define Categorical and continuous columns and take only the relevant columns.

Make the cat categories as type “category” and label encoder it.

Define the Embedding size for the categorical columns. The rule of thumb for determining the embedding size is to divide the number of unique entries in each column by 2, but not to exceed 50.

Now Let’s deal with the Continuous variables. before Normalizing them, it’s important to divide between the train and the test sets to prevent Data Leakage.

train-valid split

Split between the training and validation set. in this case, the validation set is 20% of the total training set.

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42,shuffle=True )

After this step, it important to take a look at the different shapes.

The Model

currently, our data is stored in pandas arrays. PyTorch knows how to work with Tensors. The following steps will convert our data into the right type. Keep track of your data type in each step. I added comments with the current data type.

It’s time to use PyTorch DataLoader. I chose the batch size to be 128, feel free to play with it.

Define a TabularModel

The goal is to define a model based on the number of continuous columns + the number of categorical columns and their embeddings. The output would be a single float value because of its a regression task.

ps: dropout probability for each layer
emb_drop: provide embedding dropout
emd_szs: list of tuples: each categorical variable size is paired with an embedding size
n_cont: number of continuous variables
out_sz: output size

Set a y_range for prediction (optional), and call the model. feel free to play with the inputs.

The model looks like this:

Define an optimizer. I chose Adam with a Learning Rate of 1e-2. The learning is the first Hyperparameter you should tune. moreover, the are different strategies to use the learning rate (fit one cycle, cosine, etc). Here I use a constant learning rate.

Train and Fit

Train your model. Try to track and understand each step. it’s very helpful to use the set_trace() command. The evaluation metric is RMSE.

Pass the inputs to the fit function. the loss function, in this case, is MSEloss.

Plot the Train vs Validation Loss

Finish the training section

After playing with the model and tuning the Hyperparameters you are getting to the point when you are satisfied with it. Only then you can go to the next step: test your model on the test set.

The Test set

Remember: your test has to go over the same process as the training set (we already did that). the next steps are “preparing” it to get evaluated.

Divide into categorical and continuous columns and make them a Tensor.

Make a prediction

Ok, you finally made a prediction! congrats!

Note that the prediction is now a Tensor. If one wants to change it to a Pandas Data frame walk through the steps in the repo. next, you can export it to a CSV file.

If you are doing a Kaggle competition, upload it to Kaggle to see your score.

Conclusion

In summary, you learned how to build a PyTorch model for Tabular data from scratch. You must go threw the full code and try to understand each line.

Don’t forget to connect with me on Linkedin if you have any questions, comments or concerns.

Start working!

References: