The world’s leading publication for data science, AI, and ML professionals.

10 tips I wish I had known before starting a Data Journey

After more than 6 years working with data, here's what I would have loved to know before starting this amazing journey

10 tips for a Data Journey (I wish I had known)

Mr. Bradley Efron signing for me the book "Computer Age Statistical Inference" at Stanford University. Photo by the author
Mr. Bradley Efron signing for me the book "Computer Age Statistical Inference" at Stanford University. Photo by the author

Lots of things are said about Data Science, Machine Learning, Artificial Intelligence and Big Data. From the "new oil" approach, to the "sexiest job of the XXI century", to a hyped sector of "Forty percent of ‘AI startups’ in Europe don’t actually use AI, claims report".

In the raise of a new industry, promises and expectations are constantly buzzing. Lots of people are studying the field, but in some sense, information is quite spread and not homogenised. This post is a brief summary of the 10 things I would have loved to know about data before I started:

  1. Context: How did we end up here?
  2. It’s all about dimensions
  3. Making sense over dimensions -> Training an algorithm
  4. Clever Data over complex algorithms
  5. There is NOT a Machine Learning problem waiting for you
  6. Build end-to-end, iterate fast and communicate
  7. Open Source, this is NOT a religion
  8. Learn everyday and deal with the impostor syndrome
  9. Be humble
  10. Future, professions and topics

This talk was the keynote to kick-off the 3rd edition of the AI Saturdays Madrid’s chapter. AISaturdays is an amazing initiative to democratise the access to Artificial Intelligence by gathering together and learn by coding with the aim of creating projects.

Stay tunned for the 4th edition! https://www.saturdays.ai/city-madrid.html. Photo by the author
Stay tunned for the 4th edition! https://www.saturdays.ai/city-madrid.html. Photo by the author

Let’s start!


1.- Context: How did we end up here?

At some point, it looked like all of the sudden most of the companies had a strong interest in "Big Data". From all that noise, a sentence came up: "Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it"

Three things evolved to create this "Big Data" opportunity:

  1. Data: the digitalisation of the society and the ability to gather and store all this knowledge. Think of data as raw material
  2. Computation: computational power evolved following the Moore’s law, which describes the doubling of computers’ power every two years. This enables the ability to process all this knowledge. Think of it as the tools
  3. Algorithms: algorithms we use today have been around since the 80’s (and even before), but now they can be programmed as a call-to-action **** for all this knowledge. Think of it as the _recip_e

2.- It’s all about dimensions

Think of data as dimensions, like points in a graphic. A dimension could be something as simple as an Excel column. In the following example you can see three people represented by a set of variables (dimensions). This variables can be represented in a graph:

In this case we have 3 dimensions, _distanceloom, _weeklyhours and _churnai. The objective here is to separate the different users with an algorithm.

Once you represent the information in dimensions, you will understand much better what an algorithm does.

3.- Making sense over dimensions -> Training an algorithm

Once you see data as a collection of dimensions, think of an algorithm as a tool that tries make sense over those dimensions. In the following example, the objective is to differentiate the blue from the red users with the available data (columns / dimensions). Different algorithms will extract their logic in different ways:

  • Logistic: in a linear way
  • Support Vector Machine (SVM): in a non linear way
  • Random Forest (RF): dividing each dimension in a "tree" way

The function the algorithms want to optimise to differentiate these points (in our case the separation between users) is what we call the objective function. A function that guides the algorithm, and tell us whether it is right or wrong:

The final objective is to get the Real Value and the Predicted Value as close as possible, without overfitting. The algorithm uses the dimensions available (variables) and uses its own logic to approach this objective.

4.- Clever data over complex algorithms

There is a bias for complexity in Machine Learning, period. Once you are working in the field, sometimes you may try to mix different models, in some esoteric way, which only adds more and more complexity. Usually, the simpler the better. In order to improve your algorithm’s performance, there are two clear ways:

  1. Feature Engineering: creating new variables from the current information.
  2. Clever data: adding new new data related to your problem.

Almost always, clever data will win. When you are facing an ML problem, think about which variables will help you to improve your model.

5.- There is NOT a Machine Learning problem waiting for you

Period. Most of the time, when you arrive at a company you may think they will have a Machine Learning problem, as you saw in the courses, ready for you. Reality couldn’t be further away. There are three types of situations:

  1. Companies who are working in the "big data" process, where you could be their first data guy. Which for a first job is a painful situation. This is easy to detect if you hear "yes, we have lots of data in .csv files"
  2. Companies with a good tech stack where you will have access to a database and find a ML problem. You’ll have to judge where and when to apply ML
  3. You could have a third type where you are welcomed by a great data team, but, this case is difficult to find

You should have two things in mind when looking for a Machine Learning problem:

  1. Work on questions worth answering
  2. Leverage your solution with human work

6.- Build end-to-end, iterate fast and communicate

The key for having an impact is to iterate end-to-end with the algorithm process as fast as you can. Your variables could be misleading / non realistic for your problem (leak), some of them could be unfeasible to calculate in production. Ask yourself the question: does my prediction make sense?

These are the kind of things that you realise when you iterate until the end of the process.

¿Looking for a job? Do things, share things. Have a Github repo, participate in hackathons, mix with your community and… Don’t sound like a Furby.

7.- Open Source, this is not a religion

From licensed software to open knowledge. Thank you Open Source. Nowadays, knowledge is open to everyone and companies aren’t able to have "captive knowledge". It hasn’t been that way since the beginning. After all these years there has always been a discussion, which is better:

Old enterprise software? Open source R or Python? This has currently shifted to frameworks of reference. Think in an "Open Source way", embrace knowledge and be ready to learn something new. New, unfruitful discussions like these will surely arise, do not waste your time with them.

8.- Learn everyday and deal with impostor syndrome

This is a marathon rather than a sprint. On the left graph you have what was called a Data Scientist back in the days, a unicorn (🦄). On the right one, a more realistic diagram from David Silver’s RL course (link).

In my humble opinion it’s pseudo impossible to stay up to date and to be an "A.I. expert". You can be an expert in a field, but not in all the fields. This can generate some sort of an impostor syndrome. Do not worry, it happens to almost every Data Scientist I talked to. In this profession you have to learn everyday.

9.- Be humble

This is Trevor Hastie. A Stanford professor and one of the most acknowledged statisticians in the world. The story behind the picture is that I knocked his door asking for an book autograph (yes, nerd), and we had a great conversation about statistics, Deep Learning and R.

Photo by the author.
Photo by the author.

He, one of the best, he was extremely humble and kind with me. A very important thing for the journey.

Full story here 👇

10.- Future, professions and topics

The boom of the Data Science profession could be similar to that of a "Doctor" profession being born all of the sudden. Data Scientist is a generic term that is becoming more and more specialized. Two profiles are top notch:

  1. Machine Learning engineer: cloud + algorithms
  2. Applied Data Scientist: spotting the ML oportunities

One of the most interesting topics in my opinion is Causal Inference. We are creating amazing things that predict, build insights and help us, but how do they make decisions? Can we conclude causality from our analysis? Break through the correlation in order to understand why things happen.

Have fun in your Data Journey!


Full Presentation

Do not hesitate to contact me at: [email protected]


Related Articles