The most valuable resource if you’re a Data Scientist — or wannabe

Félix Revert
Towards Data Science
3 min readApr 16, 2018

--

I’ve been working as a Data Scientist for a couple of years now, have been teaching data science both as a private tutor and as a lecturer in universities. This article reflects my own view. Please note I am not paid to promote any website.

Where to start learning and keep improving?

coursera.org, fast.ai, scikit-learn documentation, Quora — to cite some of the websites I use a lot — there’s a plethora of online resources that provide high quality hands-on data science content. But there’s one that stands out from the crowd: kaggle.com

Let me explain why Kaggle is so cool, and why you should absolutely learn on this website

What is kaggle.com?

Kaggle is an online platform for data science competitions. Moreover, it gathers one of the largest community of data scientists in the world. The types of competitions, hosted by trusted companies and organisations, range from fraud detection, to natural language processing, to image processing among others. And of course it’s free.

Currently 16 open competitions, on image processing, sales forecasting…

Learn from the best

The Kaggle community shares a lot of insights on how to become a practitioner in data science, which is gold for learners.

Kaggle provides for each competition:

  • a dataset with explanations. The data is usually real-world data, which is great since you want to be good at data science in the real world
  • the definition of the metric (or measure) on which to assess an algorithm’s performance. This is key to understand what specific goal we want to achieve with machine learning. Depending on what your metric is, a simple algorithm can sometimes be sufficient. Defining the metric will define the complexity of the problem
  • rules and prizes, stuff that do not really matter unless you’re an experienced data scientist and have hours to spend on one challenge :)

Now come the most important resources:

  • the Discussion section where data scientists share interrogations about the data or metric, where people ask for advice or help to improve a model. Great way to widen your vision of how a data scientist would solve a given problem. Rank the topics by upvotes to get the most valuable contributions
Kernels are scripts, usually Python or R, showcasing a data scientist’s skill to perform well in a competition and/or to explore and visualize datasets
  • the Kernels section — my goto resource — where data scientists not only share the very code of their algorithms to get a high score in the competition, but also their code to explore and visualize datasets. Similar to discussions, you can sort kernels by upvotes. The most upvoted kernels provide explanations and visualizations. They can be mind-blowing
An example of data visualization from a Kernel, created by a data scientist on Kaggle on a competition about road traffic in NYC. The data and the code are available for everyone for free

Below are some of the kernels most appreciated by the community. You should check them out! And don’t forget: good artists copy, great artists steal.

If you liked this article, consider giving it at least 50 👏 :)

--

--

Product Manager @Doctolib after 5 years as data scientist. Loves when ML joins Products 🤖👨‍💻