Soon or Later You Will Face This Problem

Published in

Towards Data Science

8 min readJul 19, 2017

A few days ago the Mercedes-Benz Competition hosted on kaggle ended, and this competition was one of the hardest one I ever experienced(and I”m not alone), In this post I will explain why and share some lessons we can take from it. This post is meant to be short, I’ve written it as a way to share with you the experience my team had working on this competition, explain why solving the problem presented to us is still considered a hard one and list some of the problems you too will possibly face when solving a Machine Learning problem on a side project or working on a company.

What Made Mercedes-Benz Competition so Hard?

First let me introduce you to the task presented from Daimler, when this competition was introduced it brought with it this question:

Can you cut the time a Mercedes-Benz spends on the test bench?

And there’s a bit more interesting information about the task:

“In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.”

The way worked on this problem was to treat it as Regression problem and given the features build models to predict y (test time).

In a data science project, there are many things that can go wrong but here are the common problems many competitors found and until there are some open discussions about what should be the best approaches to tackle this problem and bellow is why this competition was legendary one:

Small dataset and many features
Outliers
Sensitive metrics
Inconsistent CV and LB

(1). Small dataset and many features

Whenever you want to solve any machine learning problem, most of the time, what is ideally to do first is to look for the data(data gathering or collection) that will help you train your models. But sometimes you won’t find available the specific data you want, or you won’t find enough, and this is one the problems industry still faces until today. In the Mercedes Benz Dataset the train data was made of 3209 training examples and more than 300 features (8 categorical and the rest being numeric), and you might guess that it’s challenging to build powerful machine learning models from so little data.

But what happens if we don’t have enough data?

When we don’t have enough data, it’s very likely to our model have high variance which means overfitting becomes much harder to avoid.

The sadness of overfit

You never want to overfit. Overfitting happens when our model learn specific details of our data including noise data points (ex. Outliers) and this ends up failing to generalize, another way to think of overfitting is that it happens when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted has poor predictive performance, as it overreacts to minor fluctuations in the training data. The dataset presented to kagglers led many competitors to overfit and that’s why there’s a big shake-up between public and private leaderboard (I’m not going to details on that, ask me in comments if you really want to understand the difference between public and private leaderboard).

(2). Outliers

Noise in general becomes an issue, be it on your target variable or in some of the features. With small datasets, outliers become much dangerous and hard to deal with. Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. If the data set contains a fair amount of outliers, it’s important to either use modeling algorithm robust against outliers or filter the outliers out. Talking about the contest of Mercedes-Benz Greener Manufacturing Competition there were some outliers in the target variable(y), most of the y values were from 70 to 120 and finding values like 237 served as warning to us.

(3). Sensitive metrics

Model evaluation is used for quantifying the quality of predictions and whenever you train machine learning model, ideally you want to chose a scoring metric(for model evaluation) that will guide you to know if the changes you make are improving the performance/predictive power of you model. For classification problems is common to use accuracy, negative log loss and f1 and For regression problems on Kaggle it is common to see MSE (mean squared error), MAE (mean absolute error), or some variation of these metrics. They have varying levels of robustness, but for this contest they chose R², which was a very interesting choice. Because the metric was so simple, it made using traditional Kaggle techniques like meta-ensembling or Deep Learning fruitless. Maybe I should write a post on evaluation metrics and loss functions? you tell me.

For the Mercedes-Benz Greener Manufacturing Competition, the scoring metric was R² opposed to traditional metrics that would be used for Kaggle contests such as RMSE or MAE. In comparison, R² is a highly non-robust metric that is sensitive to outliers and is extremely prone to overfitting. Because of the ease of overfitting, the leaders of the Public Leaderboard fell so hard on the Private Leaderboard. A simple model with few features is the best way to generalize to the chosen metric, but many Kagglers are vain and would rather see a higher ranking on the Public Leaderboard. Besides the training data, our predictions should be made on unseen data (test set) and in the Mercedes Benz there were also some outliers, so if for a given unseen instance our model predicts y = 133 while the true value is 219, R² penalizes our model a lot(remember, R² is a very sensitive metric). While removing outliers from the train set lowered the score on the public LB, it is the appropriate action to take since it still gives the best score relative to others.

(4). Inconsistent CV

CV = Cross Validation.

The problem with inconsistent CV happens when we don’t successfully build a setup for proper validation of your model which can lead you to think you are doing great while not. Cross Validation is one of the best way (If not the best) I know to evaluate machine learning models, but with small datasets, many features and outiliers even cross validation can fail (particularly when using R²).

If you are not familiar with Cross Validation this draft might help you.

Evaluation procedure using K-fold cross-validation

Split the dataset into K equal partitions (or “folds”).
Use fold 1 as the testing set and the union of the other folds as the training set.
Calculate testing accuracy.
Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
Use the average testing accuracy as the estimate of out-of-sample accuracy.

Reasonable things to do

Here I will try to propose some solutions you can use if you face some of the problems mentioned above. Some of them my team used, others we didn’t.

Stick to simple models: In this competition first go to algorithms were decision trees based models and after teaming up with Steven and he told me that also simple linear models was scoring well and sometimes outperformed more complex models like Random Forest and xgboost I got surprised.
perform feature selection: This maybe was the most valuable thing to do in this competition, the dataset contained more than 300 features and from a plot of correlation matrix for some features (as there were many features this was done selecting 15 features), a correlation matrix helped us to realize there were many redundant features and this is bad, if we add redundant information to our model it keeps learning the same thing again and again and that doesn’t help, whenever it’s possible we want to remove redundant features.

correlation matrix done using seaborn (for plotting) and pandas(for data representation)

Use Ensemble models: Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Particularly if your models are learning are doing well on different things (ex: one is good for classifying negative examples and another at classifying positive examples) combining both of them will often yield to better results. Our final model was a combination of My models (most of them xgboost) and Steven’s models (RF, SVR, GBM, ElasticNet, Lasso and others). Without going into more details what I can tell you is that combining our models we got an improvement.

Final Notes

This post was not intended to be our solution write up for Mercedes-Benz Greener Manufacturing Competition but to bring some problems you might find during your research on new problem or working on a data driven project. Finding good and large data is what we always want, ideally we would like to have datasets with at least 50K training examples. This post doesn’t describe the complete solution and there are some details I omitted.