Progressively approaching Kaggle

Getting started with Kaggle competitions using the Tabular Playground Series

Rohan Rao
Towards Data Science

--

Photo by Brett Jordan on Unsplash

The Titanic Competition is most people’s first attempt at getting started on Kaggle. It has a wonderful archive of resources but if you’re looking for something newer, quicker and progressive that gets you acquainted with Kaggle competitions, then the Tabular Playground Series is a fantastic place to start.

Tabular Playground Series (TPS)

TPS is a series of monthly competitions with simple tabular datasets. It has a beginner-friendly setup to help Kagglers get comfortable with Kaggle competitions.

It gives an end-to-end experience of how competitions work and quickly enables you to build confidence of exploring the mainstream competitions.

It’s new! Most of the discussions, codes and models are very relevant for today’s Machine Learning. Outdated and unimportant resources often found in old competitions are automatically filtered out.

It’s quick! 30 days. A great setup for working on it as a project with a deadline and deliverable. Gives the experience of a real-life competition and even industry work to some extent.

That’s how Data Science projects should be.

If you ever postponed, procrastinated or felt intimidated of trying a Kaggle competition, now is a good time to leave that behind and start. And I literally mean NOW.

Work Progressively

Learn. Do. Iterate. (Image by Author)

Keep practising what you know.
Keep trying new things to learn.

Below is a sample framework for working progressively through the TPS competitions. Feel free to choose your own projects or tweak the approach to your own interests and skills as it suits you best. The order of competitions chosen is chronological for convenience but you can use your own order since they are all independent.

Project 1: TPS - January 2021

1: Read the competition information
Read the Description, Evaluation, Timeline, Prizes and Rules.

Some competitions may have more details so make it a habit to read all the information and tabs provided.

The points / tiers criteria is mentioned at the bottom of the Overview page. Beginner competitions like TPS generally do not give any points since they are primarily for learning purposes.

2: Verify the data files
Read the data description and take a quick look at the actual data: train, test, sample_submission. Check if all the fields match and descriptions match.

Understand the format of the submission file. It’s also a good idea to revisit how the submission file will be used to calculate the evaluation metric.

3: Setup your environment
Download the datasets to your local machine or use free code resources like Kaggle Notebooks and Google Colab.

Most competitions will directly allow you to launch a Notebook from the Code tab of the competition.

4: Explore the data
Explore the data. Understand the data.

Spend time on this.

It will be an ongoing process throughout the lifetime of any project so prepare yourself to continuously analyze data and learn more each time.

It’s generally a good idea to explore the data by yourself first before diving into publicly shared notebooks and discussions.

5: Read the forum
Read through the competition forum.

There is plenty of useful information and interesting discussions that take place there. Ensure you are updated and aware of them.

If you don’t mind getting updates via emails you should follow the forum. Or else check new posts and comments from time to time.

6: Read notebooks
Go through and understand public notebooks.

It is one of the best resources you could get. There would be a Starter Notebook which is a good place to begin your own code where you can improve and update as you progress.

The two most popular types of notebooks are EDA / Analysis / Informational notebooks and Modelling / Benchmarking / Submission notebooks. Don’t hesitate to fork and copy them. It’s a good practise to upvote any contributions you like or find useful.

7: Build a baseline model
Build a simplistic base model.

Having a very basic end-to-end model, either from your own code pipeline or from the starter notebook or from a public notebook is good to have so that it becomes the starting point or score from where you can work to quantitatively improve.

It’s common to have a baseline model that is simple heuristics or aggregates like mean value of the target variable and does not necessarily need to be a machine learning model.

8: Make a submission
The proof of the pudding is when you finally use a model to predict on the test data and submit it to the Kaggle leaderboard.

Make that submission! Become a Kaggler.

9: Ask questions
You are your biggest asset. You are your biggest liability. The choice is yours.

Anything that you are unsure of or you don’t understand, all you need to do is ask. The Kaggle community is active round the clock and someone or the other will help you out.

Project 1 Plan (Image by Author)

Project 2: TPS - February 2021

1: Iterate Project 1
Do everything you did in the previous project. Skip things that aren’t relevant or not interesting to you.

2: Validation framework
Validate everything you try.

Setting up a strong validation framework is the most common theme among plenty of winning solutions on Kaggle across the years.

Spend time in building the validation pipeline and test it locally as well as the public leaderboard to get a sense of how reliable it could be. Sometimes you might even need to have multiple validation strategies but it differs from competition to competition.

3: Data Cleaning
Clean the data. Remember GIGO: Garbage In, Garbage Out.

Go back to the raw datasets and prepare them to it’s cleanest form. Different kinds of pre-processing and transformations would be required for different datasets, and sometimes different models.

Test and verify the cleaning transformations using validation scores.

4: Feature engineering
Create features. It’s one of the fun parts of data science.

New features can significantly help in improving model performance. Different types of features might work for different models.

Go wild. Experiment hard. Try out as many features and ideas as you can and continuously test them using the validation scores. Usually the feature space that gives you the best performance will be a mix of some raw features and some engineered features.

Push yourself to get the best performance from a single model.

5: Error analysis
Identify the validation observations where the model is able to predict well and where the model fails. And think about why and what you can do about it. Investigate.

It is a very neglected part of the machine learning workflow to analyze model errors but it can be crucial to get ideas of new features to add.

Project 2 Plan (Image by Author)

Project 3: TPS - March 2021

1: Iterate Project 2
Do everything you did in the previous project. Skip things that aren’t relevant or not interesting to you.

2: Improve visualizations
Enhance your visualizations and presentation on Kaggle.

While you are not directly evaluated on it on Kaggle, they become useful when working on industry problems where you may be required to present your work to business stakeholders.

And of course it will help to showcase your contributions and get feedback from the community. So don’t shy away from it.

The best way to up-skill is to study highly voted EDA notebooks and learn how to build great visualizations.

3: Publish your EDA notebook
What makes Kaggle such a wonderful platform is the Kagglers and the community. Be a part of it. Contribute.

Try to experience what it takes to write an EDA notebook and get others to provide feedback on it. It doesn’t matter how many votes you get. Everyone has to start somewhere. And everyone starts with publishing their first notebook, isn’t it?

4: Share insights
If you find something interesting in the data or if you want to share an intriguing insight or report an issue or write about anything, you will always have an audience of readers. Contribute.

Post on the forum. Or write some comments. Start interacting with the community. The more you share your work, the more you learn and the more people will help you out.

Project 3 Plan (Image by Author)

Project 4: TPS - April 2021

1: Iterate Project 3
Do everything you did in the previous project. Skip things that aren’t relevant or not interesting to you.

2: Explore models
Here’s your chance to experiment and build a plethora of models to figure out which ones work best. This is the fancy part of the field but if you thought this was what Machine Learning is all about, think again. You are working on it after completing three full projects.

It’s important to read about and understand the internal workings of different models so that you can optimize them on the datasets and use them to the fullest. It is the best way to get practical experience of implementing these on real datasets and most of the latest models will be discussed on Kaggle.

3: Ensembling
Combine models. No single model is perfect.

There are many ways to marry multiple and diverse models that almost always lead to more stability in predictions and better performance.

Learn and build many different models and optimize each one of them to the best set of features and hyper-parameters first before starting to ensemble.

Start with blending models and later move to exploring stacking models.

Project 4 Plan (Image by Author)

Learn. Do. Iterate.

While some of these might seem simple, they take time to become skilled at and strong in. So practise a lot. Many of the terms and tasks require you to read and research about them which should be done as part of their implementation and experimentation.

Reading without coding is bad.
Coding without understanding is bad.

They both need to go hand-in-hand.

Nothing is written in stone. Every data scientist will experience their own unique journey. Make it enjoyable.

“Learn. Do. Iterate.” - Data Science Nightly

Find me on Twitter @vopani

--

--