Why You Are *Not* a (Data) Scientist

It is high time we clearly distinguish between engineering and scientific research in the data science domain

Shyam Sundar Dhanabalan
Towards Data Science

--

Is Bill Gates a scientist? How about Linus Torvalds? Didn’t they build something that was unique? They actually did more than that! They impacted the whole world with what they “invented” in their own ways. If we don’t call them scientists, then why are we calling junior graduates as “Data Scientists”??

Image created from screenshots of Wikipedia articles

By now, you already know that I have a fundamental problem with the term “Data Scientist” and how it obscures the nuances and intricacies required to fulfill this job.

Let us look at some definitions of what is Data Science:

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data. — Wikipedia

And what is Computer Science?

Computer science is the study of algorithmic processes, computational machines and computation itself. As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software. — Wikipedia

In my view, very broadly, data science deals with the “what” while computer science deals with the “how” in terms of computation. But, we don’t come across too many “Computer Scientist” profile. Rather, we call them Software Developers, Full Stack devs, Database engineers, etc.

Don’t get me wrong. I myself come from a scientific background and I am in the field of data. I had built and invented numerical algorithms for more than a decade, even before the hype of data science picked up. I do understand that data science needs the rigor of a scientist, needs to be fact-based with rigid process, coupled with extreme speculation of one’s own results. But, with respect to what we are doing, there is very little unexplored area to be covered. In reality, we are doing the same as what we call a proper Software Development Life Cycle (SDLC)!

Source: https://commons.wikimedia.org/wiki/File:General_System_SDLC_%26_Software_SDLC.gif

When scientific rigor and fact-based decisions are made in software engineering, user research etc., we don’t jump the gun and call them scientists, do we? May be some do, but it is not the norm.

Though we all think about data science as the hype, in my view, the role of “data scientist” is the one that is hyped up and became very abstract in recent years. The term scientist which was applied to people with in-depth knowledge and acute understanding of scientific principles is not allocated to junior candidates with less than a year of experience in their job.

Besides, there are a number of roles within the data domain that are equally or more important than advanced data science. The term “data science” itself is very vague and adds up to the confusion in many role definitions.

It is now far harder to find a data architect than a data scientist!

The hype around the role of Data Scientist is causing misinformed aspirants to flock towards those titles without understanding the real challenge of business. There are far too many applicants who assume they are scientists and use their assumption to justify a very high salary. But in reality, whenever I set up a team, the data scientist (AI/ML) member is usually the last to be hired since we need to have the data organized first. From my experience, it is now far harder to find a data architect than a data scientist! As companies realize that Data Science is garbage in and garbage out, they are now rediscovering the importance of managing data. With the increasing demand in that area, along with reducing supply due to people flocking to Data Scientist positions, the salary requirement of a good data engineer is now comparable to or more than that of a good data scientist.

We all know data science aspirants doing wonders with Kaggle datasets and universities publishing breakthrough algorithms. But, the real challenge and creativity comes through when you are banging your head to extract meaningful value among the scarce data and associated constraints. The complexity and the ensuing research is not just in algorithms, but in data wrangling and also in breaking down the problem to smaller fundamental ones.

Only a minuscule fraction of practicing data scientists are able to come up with a response for “But, how can you contribute when you have absolutely no data to start with?”

The best scientific research is started with enough groundwork before we write a single line of Python code. As a good scientist, you should be walking away from potential pitfalls and failures through evidence-based reasoning of complexities and feasibility. I seldom see data science folks do thorough research with a proposal of how to solve the problem and even help re-interpret the problem statement. Only a minuscule fraction of practicing data scientists are able to come up with a response for “But, how can you contribute when you have absolutely no data to start with?”. This solutioning mindset and willingness to do fundamental research on the problem domain and statement is severely lacking in the current era.

If you are given a technical problem to implement and you are merely applying existing process, code, or solutions in terms of machine learning, I would rather call you a “ML Developer” or “Data Science Developer”. If you are a scientist, tell me the problem you are solving, convince me that it is unique or difficult, and captivate me with your story of how you are solving it.

Real-world data scientists need to be aware of their “impact to business” and articulate how data science can add value. The gap between hard core math or computer science specialist and the business analyst is one that needs to be actively merged. The expectation of a data scientist in a business entity is to build solutions and POCs that cater to business pain points. We are already past the hype of building cool algorithms that are technically complex with questionable feasibility and limited business value.

If you are aspiring to be a data scientist, ask yourself the question “Why do I need to be a data scientist?” Very often, the job of a data analyst or ML Developer is what you are seeking after, but in the hyped-up world, everything leads to data scientist. If you can’t build a simple regression algorithm without referring the internet, you already failed the expectation of being a scientist. But, even if you pass that test, you need to be able to arrive at suitable methods to apply in real-world scenarios. If you can’t do that and if you are not a junior candidate, you again failed the evaluation as you need someone to construct the problem for you.

The fault is not with the aspirants who want to be Data Scientists. Rather, they are victims of the state of maturity in the industry with increasing job openings coming out with the title of Data Scientist. Most hiring managers like myself are caught in this dynamic too. However, things are changing and very soon with most data science engineering automated, the real “science” roles would emerge. Meanwhile, let us keep our feet on the ground and seek out positions based on what we can do rather than what the title says.

So, if you need to be a data scientist, know the math and stats, understand fundamentals, be resourceful and delivery focused, know what the business needs are, and ultimately know why you want to be called a scientist instead of an engineer or a developer!

If you already are deep in the regime of data science or a very good data science engineer with a keen sense of business value generation, reach out to me. I am always looking for capable candidates to join my team. :)

Author info

Shyam is heading the Data team for Yara SmallHolder solutions where his team is responsible for Data Management & Engineering, BI, Product Analytics, Marketing Analytics, Market Intelligence, Strategic Insights and Data Science. He has multiple experiences in setting up fully functional teams in both technology and data domains. He is also a practicing data scientist himself and has experience in data science strategy consulting for large corporates.

Please feel free to connect and reach out for a chat in LinkedIn here.

Other articles you may like

--

--