The world’s leading publication for data science, AI, and ML professionals.

10 Steps to Pass the Data Science Take-Home Task

Make sure to include the following to impress your interviewers

Note: Although this article is specifically tailored towards tasks on tabular data, most of these steps will also apply to non-tabular data (i.e. NLP or CV tasks).

Photo by Rich Tervet on Unsplash
Photo by Rich Tervet on Unsplash

1. Data Analysis & visualisation 📈

This is pretty obvious, it being the first step that Data Scientists go through when presented with a shiny new dataset. There is, unfortunately, no template for this step, so just make sure your interviewers can follow your thought process that you went through when doing the Data Science challenge. You may want to use clean, nice data visualisations to impress your readers.

2. Data transformations and feature engineering 🏗

This includes:

  • Handling categorical variables by using e.g. one-hot, target, label, count or percent encoding. It’s a good idea to use sklearn.pipeline.Pipeline to combine transformations into one pipeline. Remember to use the the same encoder or pipeline that was trained on the train data to the test dataset!
  • Re-sampling, e.g. using imblearn if your dataset is imbalanced (e.g. many more y==1 than y==0)
  • Handling dates by transforming them to numerical variables. Be careful when choosing the reference date
  • Dataset-specific feature engineering techniques. Depending on the dataset at hand, create the features that you think might have predictive power.

3. Training a state-of-the-art predictive model

Gradient boosting methods (or possibly random forest) are generally considered the state-of-the-art classifiers when it comes to tabular data. Currently, three variations of gradient boosting are particularly widely used, implemented in the following Python libraries:

  1. xgboost
  2. lightgbm
  3. catboost

Even though it might be different for every dataset, on a high level and in my experience, catboost would perform better when the dataset has many categorical features, lightgbm would generally perform slightly worse but would be much faster and xgboost would have a good performance but would generally be slower than lightgbm (although blazing fast compared to some other methods).

Remember to also choose the correct metric for your model. A good metric is AUC, since it is robust and takes into account both Specificity and Sensitivity. For more specific situations, other metrics would need to be used, depending on the cost of False Positives and False Negatives.

4. Training a basic predictive model

If you have time, it may be a good idea to train a basic model such as linear regression or logistic regression and justifying the difference in performance as a reasonable trade-off for the complexity that a more advanced model introduces.

5. Use cross-validation

This may seem obvious but keep it in mind that when reporting the performance of the model or when trying to improve it. Particularly when making small, incremental improvements in the model, it is useful have a sufficient number of folds and repeats to reduce the fluctuation in the performance metric to a minimum.

You may also want to create an explicit cross validation loop using e.g. RepeatedKFold if you use target encoding as it gives you full control over what happens in the in-bag dataset vs. the out-of-bag dataset.

If you want to take cross-validation further, you may want to run several runs in order to see how many folds you actually need to get an accurate estimate of the model performance.

6. Optimising the machine learning model

Optimal performance of the model can be achieved by finding the best hyperparameters for the dataset in question. Although rules of thumb exist, features of the dataset (for example how much interaction there is between the variables) would determine what the most optimal hyperparameters are. The following three methods are generally used for hyperparameter tuning:

  1. Grid search: Conceptually simple but very resource-intensive, particularly when optimising algorithms with a large amount of hyperparameters (such as xgboost)
  2. Random search: Can many times perform better than grid search but may not reach the best solution
  3. Bayesian Hyperparameter Optimisation: Currently a best practice

As the current best practice when it comes to hyperparameter optimisation, the third option is recommended. Besides optimising direct hyperparameters, it has an important advantage of making it possible to treat additional variables as hyperparameters. For example, besides the hyperparameters used by xgboost, the method can be used to determine

  • Which variables to one-hot encode and which to target-encode, based on cardinality (this parameter is named onehot_max_size in catboost)
  • How much smoothing to add to target encoding
  • The minimal amount of records required in a category before target-encoding it

Bayesian Optimisation ist typically organised in three steps:

  1. Defining the objective function. This includes hyperparameter depending transformations and cross-validating results
  2. Defining the hyperparameter search space, i.e. the distributions to sample from.
  3. Running the optimisation over a specified number of iterations (e.g. 1000)

7. Explaining the machine learning model

Three methodologies for model explainability are generally used, represented by the Python packages Eli5, LIME and SHAP. Currently SHAP is considered by many as best practice for models built on tabular data (and not only). SHAP has two important advantages:

  • Consistency: meaning that we can compare the attributed feature importance between models
  • Accuracy: we can sum up all feature importances to obtain the importance of the whole model

Global feature importance

The first method that is used in order to explain a model is typically feature importance. The xgboost package has a dedicated method that can be used to plot the importance of each variable. Unlike parametric methods, that produce a coefficient for each predictor, there are multiple ways of calculating the importances in gradient boosting. There are three options in xgboost:

  • Weight: given by number of times a variable occurs in the trees
  • Gain: average gain in accuracy due to splits using a particular variable
  • Cover: given by the number of observations affected by the predictor

Although the default option in plot_importance is ‘weight’, gain is often thought to be a better measurement and is used below.

Local feature importance

It can many times be useful to interpret predictions for individual observations. This can be useful both for de-bugging the model as well as for explaining the prediction for individual customers to the stakeholders. Stakeholders in Marketing, for instance, may feel more confident about communicating to customers when they know why a customer has been scored in a particular way. The local force plot in the shap library offers a nice visualisation (see below). The red factors is what push the predicted score up and the blue ones are what pull it down.

8. Use AWS or GCP – if possible ☁️

If the job requires cloud computing and you have time, it might be a good idea to include in your submissions scripts to run the code on a Cloud VM.

This is particularly the case if the dataset you received is large and training takes a long time. However, even if that is not the case, you can still use a VM for resource-intensive jobs such as hyperparameter optimisation (described above).

9. Document code and use Git – if possible

Git may not be a requirement in all Data Science jobs. If you are applying to a software company, however, it wouldn’t hurt to show that you can use git and are disciplined when it comes to version control.

10. Include unit tests – if possible

Particularly if the company you applied for is in tech, you may get bonus points if you include unit tests as well – if time allows. The Python pytest library would be a good choice. Even if you can’t manage to cover all your code with unit tests, it will be an advantage for you to show the company that you are able to write resilient, production-ready code.


Conclusion

Depending on how much time you have, you might not be able to cover all of these. Getting as many in as possible should help you impress your interviewers, though. Here’s a checklist of the 10 points:

  1. ✅ Show some insightful analysis
  2. ✅ Engineer some clever features
  3. ✅ Train a complex model
  4. ✅ Train a simple model
  5. ✅ Don’t forget Cross-Validation!
  6. ✅ Optimise model hyperparameters
  7. ✅ Explain the model
  8. ✅ Use AWS or GCP – if possible
  9. ✅ Document code and use Git – if possible
  10. ✅ Include unit tests – if possible
  11. 🎉 Negotiate the salary 💰

Good luck! 🤞


Related Articles