The world’s leading publication for data science, AI, and ML professionals.

How I ranked in the top 25% on my first Kaggle competition

Motivation for writing this post

Photo by Boris Stefanik on Unsplash
Photo by Boris Stefanik on Unsplash

The learnings shared in this article are gleaned from ranking in the top 25% (rank # 447 out 1728 participants) on my first ever Kaggle competition Tabular Playground Series – Jan 2021. My performance was far better than what I had initially expected. This came as a pleasant surprise because a) I stopped writing code for a living 6 years ago and b) it’s been a mere 5 months since I started my journey of learning Data Science.

This article is a result of me reflecting what had worked well. I want to share these learnings with other early Kagglers so that you can improve your performance on Kaggle competitions.

About you

I believe that you will get the most out of this if

  1. You are beginning your Data Science journey and have already spent some time on online courses such as [this](https://course.fast.ai/) and this
  2. You have enrolled in a few Kaggle competitions and are looking for specific tips to improve your rank

On the other hand, if you are looking for the basics of getting started with Kaggle competitions, I found useful resources [here](https://elitedatascience.com/beginner-kaggle) and here.

Four suggestions to perform well on Kaggle competitions

  1. Understand the dataset
  2. Start with a simple model
  3. Learn from everywhere. Be open
  4. Apply the Scientific method

The rest of this article contains detailed explanations of each of these four suggestions along with screenshots and code snippets from my work on the Kaggle competition Tabular Playground Series – Jan 2021. It would be instructive to join the competition, download the dataset and try out the code snippets as you read along.

Suggestion 1: Understand the dataset

The first step to be successful in Kaggle competitions is to gain a good understanding of the dataset. It is important to note that learning the ins and outs of the dataset is a continuous process and your knowledge will increase with time as you try out different models. However, it is still vital to perform some form of preliminary data analysis to get a sense of the data that you’re dealing with, even before training your first model.

The quickest way to understand your dataset, without writing a single line of code, is by navigating to the Data tab. Here, you can see which variables are continuous or categorical, view distributions of each of the variables and study the data descriptions (mean, std deviations, min, max etc).

The Data tab in the Kaggle competition space. Image is a screenshot from kaggle.com
The Data tab in the Kaggle competition space. Image is a screenshot from kaggle.com

The quickest way to understand your dataset, without writing a single line of code, is by navigating to the Data tab.

The following three insights jump out from this dataset

  1. All the features in this dataset are continuous
  2. All the features seem to be min-max normalised i.e the min values are close to 0 and the max values are close to 1. Since the features are already scaled, it would save us some time
  3. The distribution of the feature cont5 is very-much skewed to the left. It might be helpful to check if our performance might improve after applying a power transform such as the Box-Cox method to this feature
All the features in this dataset are continuous. Image is a screenshot from kaggle.com
All the features in this dataset are continuous. Image is a screenshot from kaggle.com
The distribution of the feature cont5 is skewed to the left. Image is a screenshot from kaggle.com
The distribution of the feature cont5 is skewed to the left. Image is a screenshot from kaggle.com

A more comprehensive approach to understand the dataset is to perform an Exploratory Data Analysis (EDA). There are plenty of incredible resources such as [[this](https://www.activestate.com/blog/exploratory-data-analysis-using-python/)](https://www.analyticsvidhya.com/blog/2020/08/exploratory-data-analysiseda-from-scratch-in-python/), this and this that delve deeper into this topic. However, the approach of leveraging the Data section in the Kaggle competition to gain a preliminary understanding of the dataset serves as an equally good, if not better, first step for beginners like us. This is because the Data section already contains very rich information about the data.

Suggestion 2: Start with a simple model

Now that we have a preliminary understanding of the dataset, let’s train a simple model. Starting with a simple model has the following benefits

  1. Simpler models are easier to comprehend and to explain
  2. Simpler models can serve as great baseline models. Click here to read more about baseline models. When training more complex models during later stages, we can ascertain if the added layer of sophistication results in improved accuracy vis-a-vis the baseline model
  3. Simpler models can save us the hassle of hyper-parameter%20are%20derived%20via%20training.&text=Given%20these%20hyperparameters%2C%20the%20training,the%20parameters%20from%20the%20data.) tuning without fully appreciating the nuances of the dataset and how the model interacts with it. More sophisticated models tend to contain several hyper-parameter choices. A big challenge with jumping straight into these complex models is that it makes it difficult, if not impossible, to ascertain whether the observed results are caused by poor model-selection or by sub-optimal hyper-parameters choices

Simpler models are easier to comprehend and serve as good baseline models to better orient yourself.

The below section contains the details of two simple preliminary models – Decision trees and Random Forest trained on the the Tabular Playground Series – Jan 2021 dataset.

Simple model #1: Decision trees

The below section trains a Decision tree with a stopping criteria – maximum leaf nodes = 4. It calculates the error and visualises the tree.

from sklearn.tree import DecisionTreeRegressor
#Creating a Decision tree -- with stopping criteria (max leaves = 4)
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(X_train, y_train);
#Creating a function to check the root mean squared error of the model (m_rmse)
def r_mse(pred,y): 
    return round(math.sqrt(((pred-y)**2).mean()), 4)
def m_rmse(m, xs, y): 
    return r_mse(m.predict(xs), y)

Let’s print the Root Mean Squared error

print ("training error", m_rmse(m, X_train, y_train))
print ("test error", m_rmse(m, X_test, y_test))
Image by the author
Image by the author

The RMSE error on the training set is 0.728172. The error on the test set is 0.725077. We can treat this as the performance of our baseline model.

Let’s visualize the Decision tree (our baseline model) to understand which columns/features are important

Decision tree with max leaf nodes = 4. Image by the author
Decision tree with max leaf nodes = 4. Image by the author

From the above tree diagram, it is clear that cont3, cont2 and cont7 are the top 3 most important features.

Simple model #2: Random Forest

Now that we have trained a baseline model and have established a baseline performance of 0.725077 on the test set, let’s progress to a slightly more complex model. In this section, we will train a Random Forest model. Random Forest algorithms are an ensemble or a grouping of smaller and less accurate Decision Trees. Random Forest models use a technique called Bagging to combine these less accurate Decision Trees (also known as weak learners). Lesson 7 of the fast.ai MOOC offers an excellent and a practical deep dive into the concept behind and the implementation of Random Forest algorithms.

Expected behaviour of implementing a Random Forest model on the dataset: Since Random Forests are more sophisticated models, we should expect to see a lower error i.e an improved performance vis-a-vis the Decision Tree model

from sklearn.ensemble import RandomForestRegressor
def rf(xs, y, n_estimators=40, max_samples=50000,
       max_features='sqrt', min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
mrf = rf(X_train, y_train)

Let’s print the Root Mean Squared error for the Random Forest model now

Image by the author
Image by the author

Observed behaviour: We notice that the test error of this model is 0.706288 which is lower than the test error of the Decision Tree algorithm (0.725077). Therefore, our observed behaviour is in-line with the expected behaviour.

Let’s visualize the features that the Random Forest model deems important.

#visualising the importance of the features 
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(fi[:14]);
Features sorted by importance - Random Forest algorithm. Image by the author
Features sorted by importance – Random Forest algorithm. Image by the author

It is interesting to note that the Random Forest and Decision Tree algorithms agreed that cont3 and cont2 are the top two most important features.

Suggestion 3: Learn from everywhere, be open

Having invested the time to understand the dataset and to train simple models, we are now ready to assess how more sophisticated models might perform on our dataset.

One of the most rewarding aspects of participating in Kaggle competitions is the opportunity to learn from other participants. The Code and Discussion sections (screenshot below) offer code walkthroughs of possible solutions and high-level advice on what approaches to follow. They are incredible resources for generating new ideas on which models to select and what hyper-parameter choices are available.

The Code and Discussion sections in the Kaggle competition space. Image is a screenshot from kaggle.com
The Code and Discussion sections in the Kaggle competition space. Image is a screenshot from kaggle.com

Many of the posts on the Discussion section are authored by experienced Kagglers and seasoned Data Scientists. Therefore, reading these posts and trying out the solutions will give you a good ROI on your time. The two posts that I found to be very helpful were Techniques to improve your leaderboard position by Gabriel Preda **** and Detailed EDA With LightGBM by Gaurav Rajesh Sahani. The latter is a Python Notebook and it can be found in the Code section. It inspired me to try out the LightGBM model on the dataset.

The Code and Discussion sections in the Kaggle competition space are excellent resources to generate new ideas to help decide which models to try next

Before trying out any new model, it is instructive to gain a preliminary understanding of how it works. An approach that has served me well is to start with the official documentation and read it till it stops making sense. At this point I’d pause reading the official documentation and read as many articles as I can that offer good explanations of the concept and the implementation. I would then circle back to the official documentation to round off my understanding.

Now, we will implement a LightGBM model on our dataset.

Expected behaviour: Since LightGBM models are more sophisticated than our baseline Decision Trees, we should expect to see an improved performance as measured by the RMSE error

import lightgbm as lgb
LGB = lgb.LGBMRegressor(random_state=33, n_estimators=5000, min_data_per_group=5, boosting_type='gbdt',
 num_leaves=246, max_dept=-1, learning_rate=0.005, subsample_for_bin=200000,
 lambda_l1= 1.07e-05, lambda_l2= 2.05e-06, n_jobs=-1, cat_smooth=1.0, 
 importance_type='split', metric='rmse', min_child_samples=20, min_gain_to_split=0.0, feature_fraction=0.5, 
 bagging_freq=6, min_sum_hessian_in_leaf=0.001, min_data_in_leaf=100, bagging_fraction=0.80)
m_LGB = LGB.fit(X_train, y_train)

Let’s print the Root Mean Squared error of the LightGBM model

print ("training error", m_rmse(m_LGB, X_train, y_train))
print ("test error", m_rmse(m_LGB, X_test, y_test))
Image by the author
Image by the author

Observed behaviour: We notice that the test error of the LightGBM model is 0.694172 which is lower than the test error of the Decision Tree algorithm (0.725077). Therefore, our observed behaviour is in-line with the expected behaviour.

Let’s visualize the features that the LightGBM model deems important.

#view the importance of the features
lgb.plot_importance(m_LGB, ax=None, height=0.2, xlim=None, ylim=None, 
                      title='Feature importance', xlabel='Feature importance', ylabel='Features', 
                      importance_type='split', max_num_features=None, 
                      ignore_zero=True, figsize=None, dpi=None, grid=True, precision=7)
Image by the author
Image by the author

It is interesting to note that cont3 and cont2 are not among the are the most important features of the LightGBM model.

The key takeaway from this section is that by being open and by learning from different sources such as Kaggle forums, official documentations and blog-posts you can improve your rank on the competition Leaderboard

Suggestion 4: Apply the Scientific method

The field of Data Science is constantly evolving. There are potentially infinite approaches to achieve the desired outcome. In situations such as this, when the possibilities are limitless, it is easy to get lost in the details and lose sight of the main objective – which in our case is to improve our rank on the competitions leaderboard while enhancing our knowledge. A useful approach to tackle ambiguity such as this is to apply the Scientific method to our work. We start with a hypothesis, test it to our model to prove/disprove it, draw conclusions and record the results. An important point to take note of is to test one hypothesis at a time. This would help us to more clearly assess the impact of our changes.

The following section further elaborates on this point. We will apply the Scientific method to a hypothesis based on the observation that we made about the column cont5.

Hypothesis: Might applying a transformation to the feature cont5 improve the performance of our model?

Earlier in this post, we observed that the distribution of the feature cont5 is very much skewed to the left. One possibility is that the underlying distribution of the data may be normal, but it may require a transform in order to help expose it.

This is an area where the box-cox method might come handy. It is a data transform method which can perform a range of power transforms such as taking the log or the square root of the observations in order to make the distribution more normal. This is an excellent resource for further reading about transforming data to fit the normal distribution.

Now, we will transform the column cont5 using the box-cox method and use it as a feature to the LightGBM model.

Expected behaviour: Since Cont5 is the 2nd most important column in the LightGBM model, transforming it to make its distribution more normal might lead to an improvement in the performance.

from scipy.stats import boxcox
train_df['cont5'] = boxcox(train_df['cont5'], 0)
target = train_df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(train_df, target, train_size=0.80)
#remove the id columns
X_train.pop('id')
X_test.pop('id')
import lightgbm as lgb
LGB = lgb.LGBMRegressor(random_state=33, n_estimators=5000, min_data_per_group=5, boosting_type='gbdt',
 num_leaves=246, max_dept=-1, learning_rate=0.005, subsample_for_bin=200000,
 lambda_l1= 1.07e-05, lambda_l2= 2.05e-06, n_jobs=-1, cat_smooth=1.0, 
 importance_type='split', metric='rmse', min_child_samples=20, min_gain_to_split=0.0, feature_fraction=0.5, 
 bagging_freq=6, min_sum_hessian_in_leaf=0.001, min_data_in_leaf=100, bagging_fraction=0.80)
m_LGB_box_cox = LGB.fit(X_train, y_train)

Let’s print the Root Mean Squared error of the model

Image by the author
Image by the author

The error after transforming the feature cont5 is 0.695321. This is slightly higher than the error of the LightGBM model without transforming cont5 (0.694172). Therefore, we can conclude that using the box-cox method on the column cont5 does not result in an improvement in the performance of our model. While this approach did not improve our model’s performance, we did learn something new that was valuable.

The key takeaway from this section is that following a Scientific approach will help keep us disciplined which can increase our odds of success. The key tenet of this approach is to test one change at a time so that we can clearly assess the impact of our hypotheses on the performance and consequently our position on the Leaderboard.

Below is a screenshot of how I tracked progress during the competition. I maintained a Spreadsheet in which I logged all the changes made along with the performance of the model on Training and Test datasets. Furthermore, I tracked the score that I received after making a submission. This helped me stay on track and clearly assess what worked and what didn’t.

Image by the author
Image by the author

I hope that you found the article useful. Happy Kaggling!

Code and acknowledgements

You can view the full code here.

Thanks to Samir Madhavan, Saptarishi Datta and Divya Joseph for your valuable feedback.

Connect with me

LinkedIn: https://www.linkedin.com/in/anandkumarravi/

GitHub: https://github.com/Anandravi87

Twitter: https://twitter.com/Anand_1187


Related Articles