The world’s leading publication for data science, AI, and ML professionals.

Few DON’Ts for Data Scientists

Many Data Scientists focuses on algorithms and mathematics but fail to learn an important skill – communication.

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

In this post, I present a few tips to avoid when doing Data Science. These tips are not related to coding or mathematics but are more focused on a commonly underappreciated skill – Communication. As a Data Scientist you need to report the results of experiments on a regular basis, make propositions for further work, etc. This is where the gap between the Junior and Senior Data Scientists is the greatest.

To make this post fun, I present tips to avoid through the story of an aspiring Data Scientist Tom. Hopefully, these tips will make Data Science more pleasant for you and your coworkers.

Here are a few links that might interest you:

- Labeling and Data Engineering for Conversational AI and Analytics
- Data Science for Business Leaders [Course]
- Intro to Machine Learning with PyTorch [Course]
- Become a Growth Product Manager [Course]
- Deep Learning (Adaptive Computation and ML series) [Ebook]
- Free skill tests for Data Scientists & Machine Learning Engineers

Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.

1. Don’t Report Metrics Too Soon

An aspiring Data Scientist Tom is excited about the new project that was assigned to him and he has spent a week working on it. He is expected to give a progress report on the weekly meeting and he would like to impress his coworkers with his latest findings. He hurries and adds a few lines of code to output the metric so that he can report the results.

While developing a concept in a short amount of time is not bad, reporting results too soon is. If you haven’t finished Exploratory Data Analysis, reviewed the most important features of the model and tried to predict few basic samples – don’t report the metrics, because there is a high chance that they are wrong.

When people hear the metrics they don’t expect that they are based on a faulty experiment. Then they try to justify them with real-world scenarios. The meeting drags on and it wastes everybody’s times.

2. Don’t Complain About the Data Quality

Tom finds a mistake in his dataset and he goes back to the first step – dataset extraction. This time he is more careful with the process and he finds new mistakes in the data. Tom starts complaining about data quality.

Many times you will find inconsistencies and mistakes in the data (which you should report to the team). Complaining about the data quality is not a productive habit.

What If I Told You that the data is never perfect (I haven’t worked with it yet). You are working on a dataset from a real-world, not from a Kaggle competition.

3. Don’t Go Too Deep When Discussing Data Science

Image from MakeMake
Image from MakeMake

Communication skills are what clearly separates a Junior from a Senior Data Scientist. Data Science is not just about finding patterns, training models, statistics, etc. A large part of the job is reporting the results of experiments, proposing further work and making Data Science accessible to non-technical people.

Discussing Data Science with your coworkers can be difficult as many people don’t understand it. Even some Software Developers talk about it as black magic, but who would blame them for not being a "Jack of all trades". When you intend to make an experiment with a novel Neural Network architecture, don’t go too deep into details when presenting it. Don’t get me wrong, if you are having a discussion with an expert in the field, go in great detail, but many people would just like to hear a high-level overview. Image, how would you feel, if you would have to listen to the details about a new Jira feature 🙂

When you understand a concept well you can explain it in simple terms and analogies. Google for Explain Like I am 5 to get a feeling for it.

4. Don’t Say the Project Is Finished Too Soon

Image from MemeGenerator
Image from MemeGenerator

Tom followed the first tip above and his latest experiment returned great results. From his perspective "the hardest work" has been done, so he reports that in a few days the project will be finished.

While it is true that Tom has a (potentially) great Machine Learning model on his laptop. The reality is that the model is far from production. Then many questions arise (hopefully Tom already discussed them in the planning phase):

  • What will be the business impact of the model?
  • Will the model make real-time or offline predictions?
  • Are the features of the model available in the production environment?
  • When will the model be retrained with the new data?
  • How will you version the model?
  • I could go on and on…

Be clear by what you mean with "finishing the project". The fact that your part is nearly finished, doesn’t mean that the whole project is finished. Coworkers could get a feeling that the model will already be in production next week.

5. Don’t Be One Step Behind

Think ahead from MEME
Think ahead from MEME

After one month Tom and the backend team managed to put the model in production. The model is running for some time and the Management team would like to see the report about the model performance. Tom estimates he won’t need much time to create a performance report. He swiftly takes the latest data from the database and starts working on it.

Before Tom started to extract the dataset for the report, he should think about how he will present it later. It may seem great to extract the latest data, eg. one month of data from 15th Nov to 15th Dec, but this won’t seem great to the management team, which usually look at performance metrics from the start of the month/quarter.

The other habit that Tom is guilty of is the usage of the < operator in the code. He intended to include the last 100 sales for each customer. The code n_sales < 100 will include 99 sales, not 100 – which is fine from a numerical standpoint. The problem arises when the management team asks why there are no customers with 100 sales – this has to be a bug, right? Use n_sales <= 100 to avoid this problem.

Don’t Be One Step Behind. Think one step ahead!

Conclusion

We went through a few tips to avoid when working on Data Science projects. I’ve noticed these miscommunications throughout my career by observing myself and others.

Do you notice a frequent miscommunication while working in the field or with Data Scientists? Let me know in the comments.

Before you go

Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.

Photo by Courtney Hedger on Unsplash
Photo by Courtney Hedger on Unsplash

Related Articles