The world’s leading publication for data science, AI, and ML professionals.

Change Your Mindset to Become a Better Data Scientist

An example of changing the mindset to solve data science problems better.

DATA SCIENCE

Photo by Ross Findon on Unsplash
Photo by Ross Findon on Unsplash

I have taught different topics of Data Science to different groups of scientists (mostly non-data scientists). I had a simple question at the beginning of the class to break the ice. Interestingly, most non-data scientists, unlike data scientists, found it a hard question. I am going to share the question with you. Also, I’ll show you why many non-data scientists could not answer it. I hope it helps you to change your mindset and become a better data scientist.

The Question

Here is the question that I call "Crazy Boss Puzzle."

Imagine a situation like this. I am working on a data science project, and I need to build a predictive model for a specific purpose. After months of research, study, data cleaning, and model development and testing, I develop a model that has 99% accuracy on the test dataset (blind test). Happily, I present my model to my boss, and he tests my model with another blind test. Again, my model successfully shows 99% accuracy. I am sure that my boss gets excited and congratulates me. But all of a sudden, he gets mad and throws my model away. He tells me that he gives me one more chance to come up with a better model. A week later, I come back to my boss’s office with a new model. The new model has 95% accuracy on the test data set. After presenting my new model to my boss, he gets excited and congratulates me. My boss is not CRAZY, and there is a good reason for his behavior.

Now the question is:

Can you give me an example that a model with 95% accuracy is much better than a model with 99% accuracy?

The Answer

I am sure that you agree with me that there are many good and simple answers to this question (since many readers of my articles are data scientists). To give you an example, think about a "Credit Card Fraud Detection" model. This model needs to classify transactions as VALID or FRAUDULENT. More than 99% of credit card transactions are valid transactions. A naive model that only returns VALID to every transaction fraud detection query can reach more than 99% accuracy. The goal of these kinds of projects is to detect those FRAUDULENT transactions. Accuracy is not a good measurement of how good or bad the model is. If you have a model that can detect all those fraud transactions, in addition to a few false positives (i.e., valid transactions which are flagged fraudulent incorrectly), it is a much better model compared to the naive one.

In fact, any predictive analytics project working with imbalanced data can fit in this story very well and will be the answer.

How People Miss the Answer

I found two main reasons why many non-data scientists, whom I asked this question, missed the answer.

  1. Thinking about regression problems.
  2. More accuracy means less error.

I try to explain both reasons in more detail.

Thinking about regression problems

After talking to some people in my classes, I found out that most of them tried to find some regression problem when I was talking about predictive analytics problems. Their mind was looking for a set of regression problems that fit into my story. Although it is not impossible to find a problem like that, the answer is much easier with classification problems. It seems like a simple observation, but occasionally I found scientists (mostly non-data scientists) are more thinking about regression problems when it comes to data science and predictive analytics. Part of that could be because quantitative analysis is more interesting in sciences compared to qualitative analysis. Regression is a method of quantitative analysis, and classification can be similar to qualitative analysis methods.

More accuracy means less error.

Another reason for missing the answer to my Crazy Boss Puzzle is associating more accuracy with less error. When you associate more accuracy with less error, it seems unrealistic that your boss blames you for a model with less error and cheers you for a model with a higher error. Understanding that accuracy in problems (especially classification problems) is not the only measurement of error is crucial. We must understand that error can be subjective and depending on the situation. Error definition might change. For example, for a credit card fraud detection problem, the error of not detecting fraud is much more weighted compared to the error of flagging a valid transaction.

Summary

For those scientists interested in learning and become good data scientists, in addition to studying and acquiring skills, I suggest changing the mindset. In this article, I suggested changing two mindsets. First, classification problems are as important as regression problems. Second, there are many ways to measure the error, and based on the purpose of the project, the error and measuring the success of the model might change.


Follow me on Medium and Twitter for the latest stories.


Related Articles